diff --git a/docs/anki.md b/docs/anki.md index 7e46c96..6cb3de3 100644 --- a/docs/anki.md +++ b/docs/anki.md @@ -24,7 +24,7 @@ Here is a flashcard example:
I used Anki myself with Algo Deck and Design Deck and it paid off. This method played a key role in helping me land a role as L5 SWE at Google (senior software engineer).
Here is a flashcard example:
The Anki versions (a clone of the flashcards from this repo) are available via a one-time GitHub sponsorship:
+The Anki versions (a clone of the flashcards from this repo) are available via one-time GitHub sponsorships:
Check the Anki version here.
"},{"location":"#array","title":"Array","text":""},{"location":"#algorithm-to-reverse-an-array","title":"Algorithm to reverse an array","text":"int i = 0;\nint j = a.length - 1;\nwhile (i < j) {\n swap(a, i++, j--);\n}\n
"},{"location":"#array-complexity-access-search-insert-delete","title":"Array complexity: access, search, insert, delete","text":"Access: O(1)
Search: O(n)
Insert: O(n)
Delete: O(n)
"},{"location":"#binary-search-in-a-sorted-array-algorithm","title":"Binary search in a sorted array algorithm","text":"int lo = 0, hi = a.length - 1;\n\nwhile (lo <= hi) {\n int mid = lo + ((hi - lo) / 2);\n if (a[mid] == key) {\n return mid;\n }\n if (a[mid] < key) {\n lo = mid + 1;\n } else {\n hi = mid - 1;\n }\n}\n
"},{"location":"#further-reading","title":"Further Reading","text":"Solution: binary search
Check first if the array is rotated. If not, apply normal binary search
If rotated, find pivot (smallest element, only element whose previous is bigger)
Then, check if the element is in 0..pivot-1 or pivot..len-1
int findElementRotatedArray(int[] a, int val) {\n // If array not rotated\n if (a[0] < a[a.length - 1]) {\n // We apply the normal binary search\n return binarySearch(a, val, 0, a.length - 1);\n }\n\n int pivot = findPivot(a);\n\n if (val >= a[0] && val <= a[pivot - 1]) {\n // Element is before the pivot\n return binarySearch(a, val, 0, pivot - 1);\n } else if (val >= a[pivot] && val < a.length - 1) {\n // Element is after the pivot\n return binarySearch(a, val, pivot, a.length - 1);\n }\n return -1;\n}\n
"},{"location":"#given-an-array-move-all-the-0-to-the-left-while-maintaining-the-order-of-the-other-elements","title":"Given an array, move all the 0 to the left while maintaining the order of the other elements","text":"Example: 1, 0, 2, 0, 3, 0 => 0, 0, 0, 1, 2, 3
Two pointers technique: read and write starting at the end of the array
If read is on a 0, decrement read. Otherwise swap, decrement both
public void move(int[] a) {\n int w = a.length - 1, r = a.length - 1;\n while (r >= 0) {\n if (a[r] == 0) {\n r--;\n } else {\n swap(a, r--, w--);\n }\n }\n}\n
Time complexity: O(n)
Space complexity: O(1)
"},{"location":"#how-to-detect-if-an-element-is-a-pivot-in-a-rotated-sorted-array","title":"How to detect if an element is a pivot in a rotated sorted array","text":"Only element whose previous is bigger (also the pivot is the smallest element)
"},{"location":"#how-to-find-a-pivot-element-in-a-rotated-array","title":"How to find a pivot element in a rotated array","text":"Check first if the array is rotated
Then, apply binary search (comparison with a[right] to know if we go left or right)
int findPivot(int[] a) {\n int left = 0, right = a.length - 1;\n\n // Array is not rotated\n if (a[left] < a[right]) {\n return -1;\n }\n\n while (left <= right) {\n int mid = left + ((right - left) / 2);\n if (mid > 0 && a[mid] < a[mid - 1]) {\n return a[mid];\n }\n\n if (a[mid] < a[right]) {\n // Pivot is on the left\n right = mid - 1;\n } else {\n // Pivot is on the right\n left = mid + 1;\n }\n }\n\n return -1;\n}\n
"},{"location":"#how-to-find-the-duplicates-in-an-array","title":"How to find the duplicates in an array","text":"When full, create a new array of twice the size, copy items (System.arraycopy is optimized for that)
Shrink: - Not when one-half full (otherwise worst case is too expensive: double-shrink-double-shrink etc.) - Solution: one-quarter full
"},{"location":"#how-to-test-if-the-array-is-sorted-in-ascending-or-descending-order","title":"How to test if the array is sorted in ascending or descending order","text":"Test first and last element (no iteration)
"},{"location":"#rotate-an-array-by-n-elements-n-can-be-negative","title":"Rotate an array by n elements (n can be negative)","text":"Example: 1, 2, 3, 4, 5 with n = 3 => 3, 4, 5, 1, 2
void rotateArray(List<Integer> a, int n) {\n if (n < 0) {\n n = a.size() + n;\n }\n\n reverse(a, 0, a.size() - 1);\n reverse(a, 0, n - 1);\n reverse(a, n, a.size() - 1);\n}\n
Time complexity: O(n)
Memory complexity: O(1)
"},{"location":"#bit","title":"Bit","text":""},{"location":"#operator","title":"& operator","text":"AND bit by bit
"},{"location":"#operator_1","title":"<< operator","text":"Shift on the left
n * 2 <=> left shift by 1
n * 4 <=> left shift by 2
"},{"location":"#operator_2","title":">> operator","text":"Shift on the right
"},{"location":"#operator_3","title":">>> operator","text":"Logical shift (shift the sign bit as well)
"},{"location":"#operator_4","title":"^ operator","text":"XOR bit by bit
"},{"location":"#bit-vector-structure","title":"Bit vector structure","text":"Vector (linear sequence of numeric values stored contiguously in memory) in which each element is a bit (so either 0 or 1)
"},{"location":"#check-exactly-one-bit-is-set","title":"Check exactly one bit is set","text":"boolean checkExactlyOneBitSet(int num) {\n return num != 0 && (num & (num - 1)) == 0;\n}\n
"},{"location":"#clear-bits-from-i-to-0","title":"Clear bits from i to 0","text":"int clearBitsFromITo0(int num, int i) {\n int mask = (-1 << (i + 1));\n return num & mask;\n}\n
"},{"location":"#clear-bits-from-most-significant-one-to-i","title":"Clear bits from most significant one to i","text":"int clearBitsFromMsbToI(int num, int i) {\n int mask = (1 << i) - 1;\n return num & mask;\n}\n
"},{"location":"#clear-ith-bit","title":"Clear ith bit","text":"int clearBit(final int num, final int i) {\n final int mask = ~(1 << i);\n return num & mask;\n}\n
"},{"location":"#flip-ith-bit","title":"Flip ith bit","text":"int flipBit(final int num, final int i) {\n return num ^ (1 << i);\n}\n
"},{"location":"#get-ith-bit","title":"Get ith bit","text":"boolean getBit(final int num, final int i) {\n return ((num & (1 << i)) != 0);\n}\n
"},{"location":"#how-to-flip-one-bit","title":"How to flip one bit","text":"b ^ 1
"},{"location":"#how-to-represent-signed-integers","title":"How to represent signed integers","text":"Use the most significative bit to represent the sign. Yet, it is not enough (problem with this technique: 5 + (-5) != 0)
Two's complement technique: take the one complement and add one
-3: 1101
-2: 1110
-1: 1111
0: 0000
1: 0001
2: 0010
3: 0011
The most significant bit still represents the sign
Max integer value: 1...1 (31 bits)
-1: 1...1 (32 bits)
"},{"location":"#set-ith-bit","title":"Set ith bit","text":"int setBit(final int num, final int i) {\n return num | (1 << i);\n}\n
"},{"location":"#update-a-bit-from-a-given-value","title":"Update a bit from a given value","text":"int updateBit(int num, int i, boolean bit) {\n int value = bit ? 1 : 0;\n int mask = ~(1 << i);\n return (num & mask) | (value << i);\n}\n
"},{"location":"#x-0s","title":"x & 0s","text":"0
"},{"location":"#x-1s","title":"x & 1s","text":"x
"},{"location":"#x-x","title":"x & x","text":"x
"},{"location":"#x-0s_1","title":"x ^ 0s","text":"x
"},{"location":"#x-1s_1","title":"x ^ 1s","text":"~x
"},{"location":"#x-x_1","title":"x ^ x","text":"0
"},{"location":"#x-0s_2","title":"x | 0s","text":"x
"},{"location":"#x-1s_2","title":"x | 1s","text":"1s
"},{"location":"#x-x_2","title":"x | x","text":"x
"},{"location":"#xor-operations","title":"XOR operations","text":"0 ^ 0 = 0
1 ^ 0 = 1
0 ^ 1 = 1
1 ^ 1 = 0
n XOR 0 => keep
n XOR 1 => flip
"},{"location":"#operator_5","title":"| operator","text":"OR bit by bit
"},{"location":"#operator_6","title":"~ operator","text":"Complement bit by bit
"},{"location":"#complexity","title":"Complexity","text":"Big-O Cheat Sheet
"},{"location":"#01-knapsack-brute-force-complexity","title":"0/1 Knapsack brute force complexity","text":"Time complexity: O(2^n) with n the number of items
Space complexity: O(n)
"},{"location":"#01-knapsack-memoization-complexity","title":"0/1 Knapsack memoization complexity","text":"Time and space complexity: O(n * c) with n the number items and c the capacity
"},{"location":"#01-knapsack-tabulation-complexity","title":"0/1 Knapsack tabulation complexity","text":"Time and space complexity: O(n * c) with n the number of items and c the capacity
Space complexity could even be improved to O(2*c) = O(c) as we need to store only the last 2 lines (using row%2):
int[][] dp = new int[2][c + 1];\n
"},{"location":"#amortized-complexity-definition","title":"Amortized complexity definition","text":"How much of a resource (time or memory) it takes to execute per operation on average
"},{"location":"#array-complexity-access-search-insert-delete_1","title":"Array complexity: access, search, insert, delete","text":"Access: O(1)
Search: O(n)
Insert: O(n)
Delete: O(n)
"},{"location":"#b-tree-complexity-access-insert-delete","title":"B-tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#bfs-and-dfs-graph-traversal-time-and-space-complexity","title":"BFS and DFS graph traversal time and space complexity","text":"Time: O(v + e) with v the number of vertices and e the number of edges
Space: O(v)
"},{"location":"#bfs-and-dfs-tree-traversal-time-and-space-complexity","title":"BFS and DFS tree traversal time and space complexity","text":"BFS: time O(v), space O(v)
DFS: time O(v), space O(h) (height of the tree)
"},{"location":"#big-o","title":"Big O","text":"Upper bound
"},{"location":"#big-omega","title":"Big Omega","text":"Lower bound (fastest)
"},{"location":"#big-theta","title":"Big Theta","text":"Theta(n) if both O(n) and Omega(n)
"},{"location":"#binary-heap-min-heap-or-max-heap-complexity-insert-get-min-max-delete-min-max","title":"Binary heap (min-heap or max-heap) complexity: insert, get min (max), delete min (max)","text":"Insert: O(log (n))
Get min (max): O(1)
Delete min: O(log n)
If not balanced O(n)
If balanced O(log n)
"},{"location":"#bst-delete-algo-and-complexity","title":"BST delete algo and complexity","text":"Find inorder successor and swap it
Average: O(log n)
Worst: O(h) if not self-balanced BST, otherwise O(log n)
"},{"location":"#bubble-sort-complexity-and-stability","title":"Bubble sort complexity and stability","text":"Time: O(n\u00b2)
Space: O(1)
Stable
"},{"location":"#complexity-of-a-function-making-multiple-recursive-subcalls","title":"Complexity of a function making multiple recursive subcalls","text":"Time: O(branches^depth) with branches the number of times each recursive call branches (english: 2 power 3)
Space: O(depth) to store the call stack
"},{"location":"#complexity-to-create-a-trie","title":"Complexity to create a trie","text":"Time and space: O(n * l) with n the number of words and l the longest word length
"},{"location":"#complexity-to-insert-a-key-in-a-trie","title":"Complexity to insert a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative, O(k) recursive
"},{"location":"#complexity-to-search-for-a-key-in-a-trie","title":"Complexity to search for a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative or O(k) recursive
"},{"location":"#counting-sort-complexity-stability-use-case","title":"Counting sort complexity, stability, use case","text":"Time complexity: O(n + k) // n is the number of elements, k is the range (the maximum element)
Space complexity: O(k)
Stable
Use case: known and small range of possible integers
"},{"location":"#doubly-linked-list-complexity-access-insert-delete","title":"Doubly linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#hash-table-complexity-search-insert-delete","title":"Hash table complexity: search, insert, delete","text":"All: amortized O(1), worst O(n)
"},{"location":"#heapsort-complexity-stability-use-case","title":"Heapsort complexity, stability, use case","text":"Time: Theta(n log n)
Space: O(1)
Unstable
Use case: space constrained environment with O(n log n) time guarantee
Yet, not stable and not cache friendly
"},{"location":"#insertion-sort-complexity-stability-use-case","title":"Insertion sort complexity, stability, use case","text":"Time: O(n\u00b2)
Space: O(1)
Stable
Use case: partially sorted structure
"},{"location":"#linked-list-complexity-access-insert-delete","title":"Linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#mergesort-complexity-stability-use-case","title":"Mergesort complexity, stability, use case","text":"Time: Theta(n log n)
Space: O(n)
Stable
Use case: good worst case time complexity and stable, good with linked list
"},{"location":"#quicksort-complexity-stability-use-case","title":"Quicksort complexity, stability, use case","text":"Time: best and average O(n log n), worst O(n\u00b2) if the array is already sorted in ascending or descending order
Space: O(log n) // In-place sorting algorithm
Not stable
Use case: in practice, quicksort is often faster than merge sort due to better locality (not applicable with linked list so in this case we prefer mergesort)
"},{"location":"#radix-sort-complexity-stability-use-case","title":"Radix sort complexity, stability, use case","text":"Time complexity: O(nk) // n is the number of elements, k is the maximum number of digits for a number
Space complexity: O(k)
Stable
Use case: if k < log(n) (for example 1M of elements from 0..1000 as 4 < log(1M))
"},{"location":"#recursivity-impacts-on-algorithm-complexity","title":"Recursivity impacts on algorithm complexity","text":"Space impact as each call is added to the call stack
Unless we use tail call recursion
"},{"location":"#red-black-tree-complexity-access-insert-delete","title":"Red-black tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#selection-sort-complexity","title":"Selection sort complexity","text":"Time: Theta(n\u00b2)
Space: O(1)
"},{"location":"#stack-implementations-and-insertdelete-complexity","title":"Stack implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(n), amortized time O(1)
Delete: O(1)
"},{"location":"#time-complexity-to-build-a-binary-heap","title":"Time complexity to build a binary heap","text":"O(n)
Time and space: O(v + e)
"},{"location":"#dynamic-programming","title":"Dynamic Programming","text":""},{"location":"#dynamic-programming-concept","title":"Dynamic programming concept","text":"Break down a problem in smaller parts and store the results of these subproblems so that they only need to be computed once
A DP algorithm will search through all of the possible subproblems (main difference with greedy algorithms)
Based on either: - Memoization (top-down) - Tabulation (bottom-up)
"},{"location":"#memoization-vs-tabulation","title":"Memoization vs tabulation","text":"Optimization technique to cache previously computed results
Used by dynamic programming algorithms
Memoization: top-down (start with a large, complex problem and break it down into smaller sub-problems)
f(x) {\n if (mem[x] is undefined)\n mem[x] = f(x-1) + f(x-2)\n return mem[x]\n}\n
Tabulation: bottom-up (start with the smallest solution and then build up each solution until we arrive at the solution to the initial problem)
tabFib(n) {\n mem[0] = 0\n mem[1] = 1\n for i = 2...n\n mem[i] = mem[i-2] + mem[i-1]\n return mem[n]\n}\n
"},{"location":"#encoding","title":"Encoding","text":""},{"location":"#ascii-charset","title":"ASCII charset","text":"128 characters
"},{"location":"#difference-encodingcharset","title":"Difference encoding/charset","text":"Charset: set of characters to be used (e.g. ASCII 128 characters)
Encoding: translation of a list of characters in binary
Encoding is used because for all charset we can't guarantee 1 character = 1 byte
Example: UTF-8 to encode Unicode characters using from 1 byte (english) up to 6 bytes
"},{"location":"#unicode-charset","title":"Unicode charset","text":"Superset of ASCII with 2^21 characters
"},{"location":"#general","title":"General","text":""},{"location":"#before-finding-a-solution","title":"Before finding a solution","text":"1) Make sure to understand the problem by listing: - Inputs - Outputs (what do we search) - Constraints
2) Draw examples
"},{"location":"#comparator-implementation-to-order-two-integers","title":"Comparator implementation to order two integers","text":"Ordering, min-heap: (a, b) -> a - b
Reverse ordering, max-heap: (a, b) -> b - a
7 ways: 1. a and b do not overlap 2. a and b overlap, b ends after a 3. a completely overlaps b 4. a and b overlap, a ends after b 5. b completely overlaps a 6. a and b do no overlap 7. a and b are equals
"},{"location":"#different-ways-for-two-intervals-to-relate-to-each-other-if-ordered-by-start-then-end","title":"Different ways for two intervals to relate to each other if ordered by start then end","text":"2 different ways: - No overlap - Overlap // Merge intervals (start of the first interval, max of the two ends)
"},{"location":"#divide-and-conquer-algorithm-paradigm","title":"Divide and conquer algorithm paradigm","text":"Example with merge sort: 1. Split the array into two halves 2. Sort them (recursive call) 3. Merge the two halves
"},{"location":"#how-to-name-a-matrix-indexes","title":"How to name a matrix indexes","text":"Use m[row][col] instead of m[y][x]
"},{"location":"#if-stucked-on-a-problem","title":"If stucked on a problem","text":"Mutates an input
"},{"location":"#p-vs-np-problems","title":"P vs NP problems","text":"P (polynomial): set of problems that can be solved reasonably fast (example: multiplication, sorting, etc.)
Complexity is not exponential
NP (non-deterministic polynomial): set of problems where given a solution, we can test is it is a correct one in a reasonable amount of time but finding the solution is not fast (example: a 1M*1M sudoku grid, traveling salesman problem, etc)
NP-complete: hardest problems in the NP set
There are other sets of problems that are not P nor NP as an answer is really hard to prove (example: best move in a chess game)
P = NP means does being able to quickly recognize correct answers means there's also a quick way to find them?
"},{"location":"#solving-optimization-problems","title":"Solving optimization problems","text":"Preserve the original order of elements with equal key
"},{"location":"#what-do-to-after-having-designed-a-solution","title":"What do to after having designed a solution","text":"Testing on nominal cases then edge cases
Time and space complexity
"},{"location":"#graph","title":"Graph","text":""},{"location":"#a-algorithm","title":"A* algorithm","text":"Complete solution to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While priority queue is not empty: poll an element and inserts all neighbours - If target is reached, update a min variable
Priority is computed using the evaluation function: f(n) = h + g where h is an heuristic (local cost to visit a node) and g is the cost so far (length of the path so far)
"},{"location":"#backedge-definition","title":"Backedge definition","text":"An edge from a node to itself or to an ancestor
"},{"location":"#best-first-search-algorithm","title":"Best-first search algorithm","text":"Greedy solution (non-complete) to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While target not reached: poll an element and inserts all neighbours
Priority is computed using the evaluation function: f(n) = h where h is an heuristic (local cost to visit a node)
"},{"location":"#bfs-dfs-graph-traversal-use-cases","title":"BFS & DFS graph traversal use cases","text":"BFS: shortest path
DFS: does a path exist, does a cycle exist (memo: D for Does)
DFS stores a single path at a time, requires less memory than BFS (on average but same space complexity)
"},{"location":"#bfs-and-dfs-graph-traversal-time-and-space-complexity_1","title":"BFS and DFS graph traversal time and space complexity","text":"Time: O(v + e) with v the number of vertices and e the number of edges
Space: O(v)
"},{"location":"#bidirectional-search","title":"Bidirectional search","text":"Run two simultaneous BFS, one from the source, one from the target
Once their searches collide, we found a path
If branching factor of a tree is b and the distance to the target vertex is d, then the normal BFS/DFS searching time complexity would we O(b^d)
Here it is O(b^(d/2))
"},{"location":"#connected-graph-definition","title":"Connected graph definition","text":"If there is a path between every pair of vertices, the graph is called connected
Otherwise, the graph consists of multiple isolated subgraphs
"},{"location":"#difference-best-first-search-and-a-algorithms","title":"Difference Best-first search and A* algorithms","text":"Best-first search is a greedy solution: not complete // a solution can be not optimal
A*: complete
"},{"location":"#dijkstra-algorithm","title":"Dijkstra algorithm","text":"Input: graph, initial vertex
Output: for each vertex: shortest path and previous node // The previous node is the one we are coming from in the shortest path. To find the shortest path between two nodes, we need to iterate backwards. Example: A -> C => E, D, A
Algorithm: - Init the shortest distance to MAX except for the initial node - Init a priority queue where the comparator will be on the total distance so far - Init a set to store all visited node - Add initial vertex to the priority queue - While queue is not empty: Poll a vertex (mark it visited) and check the total distance to each neighbour (current distance + distance so far), update shortest and previous arrays if smaller. If destination was unvisited, adds it to the queue
void dijkstra(GraphAjdacencyMatrix graph, int initial) {\n Set<Integer> visited = new HashSet<>();\n\n int n = graph.vertex;\n int[] shortest = new int[n];\n int[] previous = new int[n];\n for (int i = 0; i < n; i++) {\n if (i != initial) {\n shortest[i] = Integer.MAX_VALUE;\n }\n }\n\n // Entry: key=vertex, value=distance so far\n PriorityQueue<Entry> minHeap = new PriorityQueue<>((e1, e2) -> e1.value - e2.value);\n minHeap.add(new Entry(initial, 0));\n\n while (!minHeap.isEmpty()) {\n Entry current = minHeap.poll();\n int source = current.key;\n int distanceSoFar = current.value;\n\n // Get neighbours\n List<GraphAjdacencyMatrix.Edge> edges = graph.getEdge(source);\n\n for (GraphAjdacencyMatrix.Edge edge : edges) {\n // For each neighbour, check the total distance\n int distance = distanceSoFar + edge.distance;\n if (distance < shortest[edge.destination]) {\n shortest[edge.destination] = distance;\n previous[edge.destination] = source;\n }\n\n // Add the element in the queue if not visited\n if (!visited.contains(edge.destination)) {\n minHeap.add(new Entry(edge.destination, distance));\n }\n }\n\n visited.add(source);\n }\n\n print(shortest);\n print(previous);\n}\n
"},{"location":"#dynamic-connectivity-problem","title":"Dynamic connectivity problem","text":"Given a set of nodes and edges: are two nodes connected (directly or in-directly)?
Two methods: - union(2, 5) // connect object 2 with object 5 - connected(1 , 6) // is object 1 connected to object 6?
"},{"location":"#further-reading_1","title":"Further Reading","text":"Array of integer of size N initialized with their index (0: 0, 1: 1 etc.).
If two indexes have the same value, they belong to the same group.
Init: integer array of size N
Interpretation: id[i] is parent of i, root parent if id[i] == i
Modify quick-union to avoid tall tree
Keep track of the size of each tree (number of nodes): extra array size[i] to count number of objects in the tree rooted at i
O(n) extra space
Solution: topological sort
If there's a cycle in the relations, it means it is not possible to shedule all the tasks
There is a cycle if the produced sorted array size is different from n
"},{"location":"#graph-definition","title":"Graph definition","text":"A way to represent a network, or a collection of inteconnected objects
G = (V, E) with V a set of vertices (or nodes) and E a set of edges (or links)
"},{"location":"#graph-traversal-bfs","title":"Graph traversal: BFS","text":"Traverse broad into the graph by visiting the sibling/neighbor before children nodes (one level of children at a time)
Iterative using a queue
Algorithm: similar with tree except we need to mark the visited nodes, can start with any nodes
Queue<Node> queue = new LinkedList<>();\nNode first = graph.nodes.get(0);\nqueue.add(first);\nfirst.markVisitied();\n\nwhile (!queue.isEmpty()) {\n Node node = queue.poll();\n System.out.println(node.name);\n\n for (Edge edge : node.connections) {\n if (!edge.end.visited) {\n queue.add(edge.end);\n edge.end.markVisited();\n }\n }\n}\n
"},{"location":"#graph-traversal-dfs","title":"Graph traversal: DFS","text":"Traverse deep into the graph by visiting the children before sibling/neighbor nodes (traverse down one single path)
Walk through a path, backtrack until we found a new path
Algorithm: recursive or iterative using a stack (same algo than BFS except we use a queue instead of a stack)
"},{"location":"#how-to-compute-the-shortest-path-between-two-nodes-in-an-unweighted-graph","title":"How to compute the shortest path between two nodes in an unweighted graph","text":"BFS traversal by using an array to keep track of the min distance distances[i] gives the shortest distance between the input node and the node of id i
Algorithm: no need to keep track of the visited node, it is replaced by a test on the distance array
Queue<Node> queue = new LinkedList<>();\nqueue.add(parent);\nint[] distances = new int[graph.nodes.size()];\nArrays.fill(distances, -1);\ndistances[parent.id] = 0;\n\nwhile (!queue.isEmpty()) {\n Node node = queue.poll();\n for (Edge edge : node.connections) {\n if (distances[edge.end.id] == -1) {\n queue.add(edge.end);\n distances[edge.end.id] = distances[node.id] + 1;\n }\n }\n}\n
"},{"location":"#how-to-detect-a-cycle-in-a-directed-graph","title":"How to detect a cycle in a directed graph","text":"Using DFS by marking the visited nodes, there is a cycle if a visited node is also part of the current stack
The stack can be managed as a boolean array
boolean isCyclic(DirectedGraph g) {\n boolean[] visited = new boolean[g.size()];\n boolean[] stack = new boolean[g.size()];\n\n for (int i = 0; i < g.size(); i++) {\n if (isCyclic(g, i, visited, stack)) {\n return true;\n }\n }\n return false;\n}\n\nboolean isCyclic(DirectedGraph g, int node, boolean[] visited, boolean[] stack) {\n if (stack[node]) {\n return true;\n }\n\n if (visited[node]) {\n return false;\n }\n\n stack[node] = true;\n visited[node] = true;\n\n List<DirectedGraph.Edge> edges = g.getEdges(node);\n for (DirectedGraph.Edge edge : edges) {\n int destination = edge.destination;\n if (isCyclic(g, destination, visited, stack)) {\n return true;\n }\n }\n\n // Backtrack\n stack[node] = false;\n\n return false;\n}\n
"},{"location":"#how-to-detect-a-cycle-in-an-undirected-graph","title":"How to detect a cycle in an undirected graph","text":"Using DFS
Idea: for every visited vertex v, if there is an adjacent u such that u is already visited and u is not the parent of v, then there is a cycle
public boolean isCyclic(UndirectedGraph g) {\n boolean[] visited = new boolean[g.size()];\n for (int i = 0; i < g.size(); i++) {\n if (!visited[i]) {\n if (isCyclic(g, i, visited, -1)) {\n return true;\n }\n }\n }\n return false;\n}\n\nprivate boolean isCyclic(UndirectedGraph g, int v, boolean[] visited, int parent) {\n visited[v] = true;\n\n List<UndirectedGraph.Edge> edges = g.getEdges(v);\n for (UndirectedGraph.Edge edge : edges) {\n if (!visited[edge.destination]) {\n if (isCyclic(g, edge.destination, visited, v)) {\n return true;\n }\n } else if (edge.destination != parent) {\n return true;\n }\n }\n return false;\n}\n
"},{"location":"#how-to-name-a-graph-with-directed-edges-and-without-cycle","title":"How to name a graph with directed edges and without cycle","text":"Directed Acyclic Graph (DAG)
"},{"location":"#how-to-name-a-graph-with-few-edges-and-with-many-edges","title":"How to name a graph with few edges and with many edges","text":"Sparse: few edges
Dense: many edges
"},{"location":"#how-to-name-the-number-of-edges","title":"How to name the number of edges","text":"Degree of a vertex
"},{"location":"#how-to-represent-the-edges-of-a-graph-structure-and-complexity","title":"How to represent the edges of a graph (structure and complexity)","text":"Using an adjacency matrix: two-dimensional array of boolean with a[i][j] is true if there is an edge between node i and j
Time complexity: O(1)
Problem: - If graph is undirected: half of the space is useless - If graph is sparse, we still have to consume O(v\u00b2) space
Using an adjacency list: array (or map) of linked list with a[i] represents the edges for the node i
Time complexity: O(d) with d the degree of a vertex
Time and space: O(v + e)
"},{"location":"#topological-sort-technique","title":"Topological sort technique","text":"If there is an edge from U to V, then U <= V
Possible only if the graph is a DAG
Algo: - Create a graph representation (adjacency list) and an in degree counter (Map) - Zero them for each vertex - Fill the adjacency list and the in degree counter for each edge - Add in a queue each vertex whose in degree count is 0 (source vertex with no parent) - While the queue is not empty, poll a vertex from it then decrement the in degree of its children (no removal)
To check if there is a cycle, we must compare the size of the produced array to the number of vertices
List<Integer> sort(int vertices, int[][] edges) {\n if (vertices == 0) {\n return Collections.EMPTY_LIST;\n }\n\n List<Integer> sorted = new ArrayList<>(vertices);\n // Adjacency list graph\n Map<Integer, List<Integer>> graph = new HashMap<>();\n // Count of incoming edges for each vertex\n Map<Integer, Integer> inDegree = new HashMap<>();\n\n for (int i = 0; i < vertices; i++) {\n inDegree.put(i, 0);\n graph.put(i, new LinkedList<>());\n }\n\n // Init graph and inDegree\n for (int[] edge : edges) {\n int parent = edge[0];\n int child = edge[1];\n\n graph.get(parent).add(child);\n inDegree.put(child, inDegree.get(child) + 1);\n }\n\n // Create a source queue and add each source (a vertex whose inDegree count is 0)\n Queue<Integer> sources = new LinkedList<>();\n for (Map.Entry<Integer, Integer> entry : inDegree.entrySet()) {\n if (entry.getValue() == 0) {\n sources.add(entry.getKey());\n }\n }\n\n while (!sources.isEmpty()) {\n int vertex = sources.poll();\n sorted.add(vertex);\n\n // For each vertex, we will decrease the inDegree count of its children\n List<Integer> children = graph.get(vertex);\n for (int child : children) {\n inDegree.put(child, inDegree.get(child) - 1);\n if (inDegree.get(child) == 0) {\n sources.add(child);\n }\n }\n }\n\n // Topological sort is not possible as the graph has a cycle\n if (sorted.size() != vertices) {\n return new ArrayList<>();\n }\n\n return sorted;\n}\n
"},{"location":"#travelling-salesman-problem","title":"Travelling salesman problem","text":"Find the shortest possible route that visits every city (vertex) exactly once
Possible solutions: - Greedy: nearest neighbour - Dynamic programming: compute optimal solution for a path of length n by using information already known for partial tours of length n-1 (time complexity: n^2 * 2^n)
"},{"location":"#two-types-of-graphs","title":"Two types of graphs","text":"Directed graph (with directed edges)
Undirected graph (with undirected edges)
"},{"location":"#greedy","title":"Greedy","text":""},{"location":"#best-first-search-algorithm_1","title":"Best-first search algorithm","text":"Greedy solution (non-complete) to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While target not reached: poll an element and inserts all neighbours
Priority is computed using the evaluation function: f(n) = h where h is an heuristic (local cost to visit a node)
"},{"location":"#greedy-algorithm","title":"Greedy algorithm","text":"Algorithm paradigm of making the locally optimal choice at each stage using a heuristic function
A locally optimal function does not necesseraly mean to not have a global context for taking a decision
Never reconsider a choice (main difference with dynamic programming)
Solution found may not be the most optimal one
"},{"location":"#greedy-algorithm-structure","title":"Greedy algorithm: structure","text":"Often, the global context is spread into a priority queue
"},{"location":"#greedy-technique","title":"Greedy technique","text":"Identify an optimal subproblem or substructure in the problem and determine how to reach it
Focus on what you have now (don't think about what comes next)
We may want to apply the traversal technique to have a global context for the identification part (a map of letters/positions etc.)
"},{"location":"#technique-optimization-problems-requiring-a-min-or-max","title":"Technique - Optimization problems requiring a min or max","text":"Greedy technique
"},{"location":"#hash-table","title":"Hash Table","text":""},{"location":"#hash-table-complexity-search-insert-delete_1","title":"Hash table complexity: search, insert, delete","text":"All: amortized O(1), worst O(n)
"},{"location":"#hash-table-implementation","title":"Hash table implementation","text":"Resize the array when a threshold is reached
If extreme nonuniform distribution, could be replaced by array of BST
"},{"location":"#heap","title":"Heap","text":""},{"location":"#binary-heap-min-heap-or-max-heap-complexity-insert-get-min-max-delete-min-max_1","title":"Binary heap (min-heap or max-heap) complexity: insert, get min (max), delete min (max)","text":"Insert: O(log (n))
Get min (max): O(1)
Delete min: O(log n)
"},{"location":"#binary-heap-min-heap-or-max-heap-data-structure-used-for-the-implementation","title":"Binary heap (min-heap or max-heap) data structure used for the implementation","text":"Using an array
If children at index i: - Left children: 2 * i + 1 - Right children: 2 * i + 2 - Parent: (i - 1) / 2
"},{"location":"#binary-heap-min-heap-or-max-heap-definition","title":"Binary heap (min-heap or max-heap) definition","text":"A binary heap is a a complete binary tree with min-heap or max-heap property ordering. Also called min heap or max heap.
Min heap: each node smaller than its children, min value element at the root.
Two operations: insert(), getMin()
Difference BST: in a BST, each smaller element is on the left and greater element on the right, here a smaller element can be found on the left or the right side.
"},{"location":"#binary-heap-min-heap-or-max-heap-delete-min","title":"Binary heap (min-heap or max-heap) delete min","text":"Replace min element (root) with the last node (left-most, lowest-level node because a binary heap is a complete binary tree)
If violations, swap with the smallest child (level by level)
"},{"location":"#binary-heap-min-heap-or-max-heap-insert-algorithm","title":"Binary heap (min-heap or max-heap) insert algorithm","text":"Insert node at the end (left-most spot because a binary heap is a complete binary tree)
If violations, swap with parents until no more violation
"},{"location":"#binary-heap-min-heap-or-max-heap-use-cases","title":"Binary heap (min-heap or max-heap) use-cases","text":"Priority queue
"},{"location":"#comparator-implementation-to-order-two-integers_1","title":"Comparator implementation to order two integers","text":"Ordering, min-heap: (a, b) -> a - b
Reverse ordering, max-heap: (a, b) -> b - a
"},{"location":"#convert-an-array-into-a-binary-heap-in-place","title":"Convert an array into a binary heap in place","text":"For i from 0 to n-1, swap recursively element a[i] until min/max heap violation on its node
"},{"location":"#find-the-median-of-a-stream-of-numbers-2-methods-insertint-and-int-findmedian","title":"Find the median of a stream of numbers, 2 methods insert(int) and int findMedian()","text":"Solution: two heap technique
Keep two heaps and maintain the balance by transfering an element from one heap to another if not balanced
Return the median (difference if even or odd)
// First half\nPriorityQueue<Integer> maxHeap = new PriorityQueue<>((a, b) -> b - a);\n// Second half\nPriorityQueue<Integer> minHeap = new PriorityQueue<>();\n\npublic void insertNum(int n) {\n // First element\n if (minHeap.isEmpty()) {\n minHeap.add(n);\n return;\n }\n\n // Insert into min or max heap\n Integer minSecondHalf = minHeap.peek();\n if (n >= minSecondHalf) {\n minHeap.add(n);\n } else {\n maxHeap.add(n);\n }\n\n // Is balanced?\n if (minHeap.size() > maxHeap.size() + 1) {\n maxHeap.add(minHeap.poll());\n } else if (maxHeap.size() > minHeap.size() + 1) {\n minHeap.add(maxHeap.poll());\n }\n}\n\npublic double findMedian() {\n // Even\n if (minHeap.size() == maxHeap.size()) {\n return (double) (minHeap.peek() + maxHeap.peek()) / 2;\n }\n\n // Odd\n if (minHeap.size() > maxHeap.size()) {\n return minHeap.peek();\n }\n return maxHeap.peek();\n}\n
"},{"location":"#given-an-unsorted-array-of-numbers-find-the-k-largest-numbers-in-it","title":"Given an unsorted array of numbers, find the K largest numbers in it","text":"Solution: using a min heap but we keep only K elements in it
public static List<Integer> findKLargestNumbers(int[] nums, int k) {\n PriorityQueue<Integer> minHeap = new PriorityQueue<>();\n\n // Put the first K numbers\n for (int i = 0; i < k; i++) {\n minHeap.add(nums[i]);\n }\n\n // Iterate on the rest of the array\n // Check whether the current element is bigger than the smallest one\n for (int i = k; i < nums.length; i++) {\n if (nums[i] > minHeap.peek()) {\n minHeap.poll();\n minHeap.add(nums[i]);\n }\n }\n\n return toList(minHeap);\n}\n\npublic static List<Integer> toList(PriorityQueue<Integer> minHeap) {\n List<Integer> list = new ArrayList<>(minHeap.size());\n while (!minHeap.isEmpty()) {\n list.add(minHeap.poll());\n }\n\n return list;\n}\n
Space complexity: O(k)
"},{"location":"#heapsort-algorithm","title":"Heapsort algorithm","text":"Stable
"},{"location":"#time-complexity-to-build-a-binary-heap_1","title":"Time complexity to build a binary heap","text":"O(n)
"},{"location":"#two-heaps-technique","title":"Two heaps technique","text":"Keep two heaps: - A max heap for the first half - Then a min heap for the second half
May be required to balance them to have at most a difference in terms of size of 1
"},{"location":"#why-binary-heap-over-bst-for-priority-queue","title":"Why binary heap over BST for priority queue?","text":"BST needs an extra pointer to the min or max value (otherwise finding the min or max is O(log n))
Implemented using an array: faster in practice (better locality, more cache friendly)
Building a binary heap is O(n), instead of O(n log n) for a BST
"},{"location":"#linked-list","title":"Linked List","text":""},{"location":"#algorithm-to-reverse-a-linked-list","title":"Algorithm to reverse a linked list","text":"public ListNode reverse(ListNode head) {\n ListNode previous = null;\n ListNode current = head;\n\n while (current != null) {\n // Keep temporary next node\n ListNode next = current.next;\n // Change link\n current.next = previous;\n // Move previous and current\n previous = current;\n current = next;\n }\n\n return previous;\n}\n
"},{"location":"#doubly-linked-list","title":"Doubly linked list","text":"Each node contains a pointer to the previous and the next node
"},{"location":"#doubly-linked-list-complexity-access-insert-delete_1","title":"Doubly linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#get-the-middle-of-a-linked-list","title":"Get the middle of a linked list","text":"Using the runner technique
"},{"location":"#iterate-over-two-linked-lists","title":"Iterate over two linked lists","text":"while (l1 != null || l2 != null) {\n\n}\n
"},{"location":"#linked-list-complexity-access-insert-delete_1","title":"Linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#linked-list-questions-prerequisite","title":"Linked list questions prerequisite","text":"Single or doubly linked list?
"},{"location":"#queue-implementations-and-insertdelete-complexity","title":"Queue implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(1)
Delete: O(1)
"},{"location":"#ring-buffer-or-circular-buffer-structure","title":"Ring buffer (or circular buffer) structure","text":"Data structure using a single, fixed-sized buffer as if it were connected end-to-end
"},{"location":"#what-if-we-need-to-iterate-backwards-on-a-singly-linked-list-in-constant-space-without-mutating-the-input","title":"What if we need to iterate backwards on a singly linked list in constant space without mutating the input?","text":"Reverse the liked list (or a subpart only), implement the algo then reverse it again to the initial state
"},{"location":"#math","title":"Math","text":""},{"location":"#a-a-property","title":"a = a property","text":"Reflexive
"},{"location":"#if-a-b-and-b-c-then-a-c-property","title":"If a = b and b = c then a = c property","text":"Transitive
"},{"location":"#if-a-b-then-b-a-property","title":"If a = b then b = a property","text":"Symmetric
"},{"location":"#logarithm-definition","title":"Logarithm definition","text":"Inverse function to exponentiation
If odd: middle value
If even: average of the two middle values (1, 2, 3, 4 => (2 + 3) / 2 = 2.5)
"},{"location":"#n-choose-k-problems","title":"n-choose-k problems","text":"From a set of n items, choose k items with 0 <= k <= n
P(n, k)
Order matters: n! / (n - k)! // How many permutations
Order does not matter: n! / ((n - k)! k!) // How many combinations
"},{"location":"#probability-pa-b-inter","title":"Probability: P(a \u2229 b) // inter","text":"P(a \u2229 b) = P(a) * P(b)
"},{"location":"#probability-pa-b-union","title":"Probability: P(a \u222a b) // union","text":"P(a \u222a b) = P(a) + P(b) - P(a \u2229 b)
"},{"location":"#probability-pba-probability-of-a-knowing-b","title":"Probability: Pb(a) // probability of a knowing b","text":"Pb(a) = P(a \u2229 b) / P(b)
"},{"location":"#queue","title":"Queue","text":""},{"location":"#dequeue-data-structure","title":"Dequeue data structure","text":"Double ended queue for which elements can be added or removed from either the front (head) or the back (tail)
"},{"location":"#queue_1","title":"Queue","text":"FIFO (First In First Out)
"},{"location":"#queue-implementations-and-insertdelete-complexity_1","title":"Queue implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(1)
Delete: O(1)
"},{"location":"#recursion","title":"Recursion","text":""},{"location":"#how-to-handle-a-recursive-function-that-need-to-return-a-list","title":"How to handle a recursive function that need to return a list","text":"Input: - Result List - Current iteration element
Output: void
void f(List<String> result, String current) {\n // Do something\n result.add(...);\n}\n
"},{"location":"#how-to-handle-a-recursive-function-that-need-to-return-a-maximum-value","title":"How to handle a recursive function that need to return a maximum value","text":"Implementation: return max(f(a), f(b))
"},{"location":"#loop-inside-of-a-recursive-function","title":"Loop inside of a recursive function?","text":"Might be a code smell. The iteration is already brought by the recursion itself.
"},{"location":"#sort","title":"Sort","text":""},{"location":"#bubble-sort-algorithm","title":"Bubble sort algorithm","text":"Walk through a collection and compares 2 elements at a time
If they are out of order, swap them
Continue until the entire collection is sorted
"},{"location":"#bubble-sort-complexity-and-stability_1","title":"Bubble sort complexity and stability","text":"Time: O(n\u00b2)
Space: O(1)
Stable
"},{"location":"#counting-sort-complexity-stability-use-case_1","title":"Counting sort complexity, stability, use case","text":"Time complexity: O(n + k) // n is the number of elements, k is the range (the maximum element)
Space complexity: O(k)
Stable
Use case: known and small range of possible integers
"},{"location":"#counting-sort-algorithm","title":"Counting sort algorithm","text":"If range r is known
1) Create an array of size r where each a[i] represents the number of occurences of i
2) Modify the array to store the cumulative sum (if a=[1, 3, 0, 2] => [1, 4, 4, 6])
3) Right shift the array with a backward iteration (element at index 0 is 0 => [0, 1, 4, 4]) Now a[i] represents the first index of i if array was sorted
4) Create the sorted array by filling the elements from their first index
"},{"location":"#heapsort-algorithm_1","title":"Heapsort algorithm","text":"Time: Theta(n log n)
Space: O(1)
Unstable
Use case: space constrained environment with O(n log n) time guarantee
Yet, not stable and not cache friendly
"},{"location":"#insertion-sort-algorithm","title":"Insertion sort algorithm","text":"From i to 0..n, insert a[i] to its correct position to the left (0..i)
Used by humans
"},{"location":"#insertion-sort-complexity-stability-use-case_1","title":"Insertion sort complexity, stability, use case","text":"Time: O(n\u00b2)
Space: O(1)
Stable
Use case: partially sorted structure
"},{"location":"#mergesort-algorithm","title":"Mergesort algorithm","text":"Splits a collection into 2 halves, sort the 2 halves (recursive call) then merge them together to form one sorted collection
void mergeSort(int[] a) {\n int[] helper = new int[a.length];\n mergeSort(a, helper, 0, a.length - 1);\n}\n\nvoid mergeSort(int a[], int helper[], int lo, int hi) {\n if (lo < hi) {\n int mid = (lo + hi) / 2;\n\n mergeSort(a, helper, lo, mid);\n mergeSort(a, helper, mid + 1, hi);\n merge(a, helper, lo, mid, hi);\n }\n}\n\nprivate void merge(int[] a, int[] helper, int lo, int mid, int hi) {\n // Copy into helper\n for (int i = lo; i <= hi; i++) {\n helper[i] = a[i];\n }\n\n int p1 = lo; // Pointer on the first half\n int p2 = mid + 1; // Pointer on the second half\n int index = lo; // Index of a\n\n // Copy the smallest values from either the left or the right side back to the original array\n while (p1 <= mid && p2 <= hi) {\n if (helper[p1] <= helper[p2]) {\n a[index] = helper[p1];\n p1++;\n } else {\n a[index] = helper[p2];\n p2++;\n }\n index++;\n }\n\n // Copy the eventual rest of the left side of the array into the target array\n while (p1 <= mid) {\n a[index] = helper[p1];\n index++;\n p1++;\n }\n}\n
"},{"location":"#further-reading_2","title":"Further Reading","text":"Time: Theta(n log n)
Space: O(n)
Stable
Use case: good worst case time complexity and stable, good with linked list
"},{"location":"#quicksort-algorithm","title":"Quicksort algorithm","text":"Sort a collection by repeatedly choosing a pivot and partitioning the collection around it (smaller before, larger after)
Here the pivot will be the last element of the subarray
In an ideal world, the pivot would be the middle element so that we partition the array in two subsets of equal size
The worst case is to find a pivot element at the top left or top right index of the subarray
void quickSort(int[] a) {\n quickSort(a, 0, a.length - 1);\n}\n\nvoid quickSort(int a[], int lo, int hi) {\n if (lo < hi) {\n int pivot = partition(a, lo, hi);\n quickSort(a, lo, pivot - 1);\n quickSort(a, pivot + 1, hi);\n }\n}\n\n// Returns an index so that all element before that index are smaller\n// And all element after are bigger\nint partition(int a[], int lo, int hi) {\n int pivot = a[hi];\n int pivotIndex = lo; // Will represent the pivot index\n\n // Iterate using the two pointers technique\n for (int i = lo; i < hi; i++) {\n // If the current index is smaller, swap and increment pivot index\n if (a[i] <= pivot) {\n swap(a, pivotIndex++, i);\n }\n }\n\n swap(a, pivotIndex, hi);\n return pivotIndex;\n}\n
"},{"location":"#quicksort-complexity-stability-use-case_1","title":"Quicksort complexity, stability, use case","text":"Time: best and average O(n log n), worst O(n\u00b2) if the array is already sorted in ascending or descending order
Space: O(log n) // In-place sorting algorithm
Not stable
Use case: in practice, quicksort is often faster than merge sort due to better locality (not applicable with linked list so in this case we prefer mergesort)
"},{"location":"#radix-sort-algorithm","title":"Radix sort algorithm","text":"Sort by applying counting sort on one digit at a time (least to most significant) Each new level must be stable (if equals, keep the order of the previous level)
Example:
Time complexity: O(nk) // n is the number of elements, k is the maximum number of digits for a number
Space complexity: O(k)
Stable
Use case: if k < log(n) (for example 1M of elements from 0..1000 as 4 < log(1M))
"},{"location":"#selection-sort-algorithm","title":"Selection sort algorithm","text":"From i to 0..n, find repeatedly the min element then swap it with i
"},{"location":"#selection-sort-complexity_1","title":"Selection sort complexity","text":"Time: Theta(n\u00b2)
Space: O(1)
"},{"location":"#shuffling-an-array","title":"Shuffling an array","text":"Fisher-Yates shuffle algorithm: - Iterate over each element (i) - Pick a random index (from 0 to i included) and swap with the current element
"},{"location":"#stack","title":"Stack","text":""},{"location":"#stack_1","title":"Stack","text":"LIFO (Last In First Out)
"},{"location":"#stack-implementations-and-insertdelete-complexity_1","title":"Stack implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(n), amortized time O(1)
Delete: O(1)
"},{"location":"#string","title":"String","text":""},{"location":"#first-check-to-test-if-two-strings-are-a-permutation-or-a-rotation-of-each-other","title":"First check to test if two strings are a permutation or a rotation of each other","text":"Same length
"},{"location":"#how-to-print-all-the-possible-permutations-of-a-string","title":"How to print all the possible permutations of a string","text":"Recursion with backtracking
void permute(String s) {\n permute(s, 0);\n}\n\nvoid permute(String s, int index) {\n if (index == s.length() - 1) {\n System.out.println(s);\n return;\n }\n\n for (int i = index; i < s.length(); i++) {\n s = swap(s, index, i);\n permute(s, index + 1);\n s = swap(s, index, i);\n }\n}\n
"},{"location":"#rabin-karp-substring-search","title":"Rabin-Karp substring search","text":"Searching a substring s in a string b takes O(s(b-s)) time
Trick: compute the hash of each substring s
Sliding window of size s
Time complexity: O(b)
If hash matches, check if the string are equals (as two different strings can have the same hash)
"},{"location":"#string-permutation-vs-rotation","title":"String permutation vs rotation","text":"Permutation: contains the same characters in an order that can be different (abdc and dabc)
Rotation: rotates according to a pivot
"},{"location":"#string-questions-prerequisite","title":"String questions prerequisite","text":"Case sensitive?
Encoding?
"},{"location":"#technique","title":"Technique","text":"14 Patterns to Ace Any Coding Interview Question by Fahim ul Haq
"},{"location":"#01-knapsack-brute-force-technique","title":"0/1 Knapsack brute force technique","text":"Recursive approach: solve f(c, i) with c is the remaining capacity and i is th current item index At each level, we branch with the item at index i (if enough capacity) and without it
public int knapsack(int[] profits, int[] weights, int c) {\n return knapsack(profits, weights, c, 0, 0);\n}\n\npublic int knapsack(int[] profits, int[] weights, int c, int i, int sum) {\n if (i == profits.length || c <= 0) {\n return sum;\n }\n\n // Not\n int sum1 = knapsack(profits, weights, c, i + 1, sum);\n\n // With\n int sum2 = 0;\n if (weights[i] <= c) {\n sum2 = knapsack(profits, weights, c - weights[i], i + 1, sum + profits[i]);\n }\n\n return Math.max(sum1, sum2);\n}\n
"},{"location":"#01-knapsack-memoization-technique","title":"0/1 Knapsack memoization technique","text":"Memoization: store a[c][i] (c is the remaining capacity, i is the current item index)
As we need to store the 0 capacity, we have to init the array this way:
int[][] a = new int[c + 1][n] // n is the number of items
Time and space complexity: O(n * c)
public int knapsack(int[] profits, int[] weights, int capacity) {\n // Capacity from 1 to n\n Integer[][] a = new Integer[capacity][profits.length];\n return knapsack(profits, weights, capacity, 0, 0, a);\n}\n\npublic int knapsack(int[] profits, int[] weights, int capacity, int i, int sum, Integer[][] a) {\n if (i == profits.length || capacity == 0) {\n return sum;\n }\n\n // If value already exists, return \n if (a[capacity - 1][i] != null) {\n return a[capacity][i];\n }\n\n // With\n int sum1 = knapsack(profits, weights, capacity, i + 1, sum, a);\n // Without\n int sum2 = 0;\n if (weights[i] <= capacity) {\n sum2 = knapsack(profits, weights, capacity - weights[i], i + 1, sum + profits[i], a);\n }\n\n a[capacity - 1][i] = Math.max(sum1, sum2);\n return a[capacity - 1][i];\n}\n
"},{"location":"#01-knapsack-tabulation-technique","title":"0/1 Knapsack tabulation technique","text":"Two dimensional array: a[n + 1][c + 1] // n the number of items and c the max capacity
First row and first column are set to 0
a[row][col] represent the max profit with items 1..row at capacity col
remainingWeight = col - itemWeight // col: current max capacity
a[row][col] = max(a[row - 1][col], itemValue + a[row - 1][remainingWeight]) // max between item not selected and item selected + max remaining weight
If remainingWeight < 0, we can't chose the item so a[row][col] = a[row - 1][col]
Return last element of the array
public int solveKnapsack(int[] profits, int[] weights, int capacity) {\n int[][] a = new int[profits.length + 1][capacity + 1];\n\n for (int row = 1; row < profits.length + 1; row++) {\n int value = profits[row - 1];\n int weight = weights[row - 1];\n for (int col = 1; col < capacity + 1; col++) {\n int remainingWeight = col - weight;\n if (remainingWeight < 0) {\n a[row][col] = a[row - 1][col];\n } else {\n a[row][col] = Math.max(\n a[row - 1][col],\n value + a[row - 1][remainingWeight]\n );\n }\n }\n }\n\n return a[profits.length][capacity];\n}\n
If we need to compute a result like \"determine if a subset exists\" that return a boolean, the array type is boolean[][]
As we are only interested in the previous row, we can also use an int[2][n] array
"},{"location":"#backtracking-technique","title":"Backtracking technique","text":"Solution for solving a problem recursively
Loop: - apply() // Apply a change - try() // Try a solution - reverse() // Reverse apply
"},{"location":"#cyclic-sort-technique","title":"Cyclic sort technique","text":"Iterate over each number of an array and swap it to its correct position
At the end, we may iterate on the array to check which number is not at its correct position
If numbers are not within the 1 to n range, we can simply drop them
Alternative: marker technique (mark a result by setting a[i] to negative for example)
"},{"location":"#greedy-technique_1","title":"Greedy technique","text":"Identify an optimal subproblem or substructure in the problem and determine how to reach it
Focus on what you have now (don't think about what comes next)
We may want to apply the traversal technique to have a global context for the identification part (a map of letters/positions etc.)
"},{"location":"#k-way-merge-technique","title":"K-way merge technique","text":"Given K sorted array, technique to perform a sorted traversal of all the elements of all arrays
We need to keep track of which structure the min element come from (tracking the array index or taking the next node if it's a linked list)
"},{"location":"#runner-technique","title":"Runner technique","text":"Iterate over the linked list with two pointers simultaneously either with: - One ahead by a fixed amount - One faster
This technique can also be applied on other problems where we need to find a cycle (f(slow) and f(f(fast)) may converge)
"},{"location":"#simplification-technique","title":"Simplification technique","text":"Simplify the problem. If solvable, generalize to the initial problem.
Example: sort the array first
"},{"location":"#sliding-window-technique","title":"Sliding window technique","text":"Range of elements in a specific window size
Two pointers left and right: - Move right while condition is valid - Move left if condition is not valid
"},{"location":"#subsets-technique","title":"Subsets technique","text":"Technique to find all the possible permutations or combinations
Start with an empty set, for each element of the input, add them to all the existing subsets to create new subsets
Example: - Given [1, 5, 3] - => [] // Start - => [], [1] - => [], [1], [5], [1,5] - => [], [1], [5], [1,5], [3], [1,3], [1,5,3]
For each level, we iterate from 0 to size // size is the fixed size of the list
List<List<Integer>> findSubsets(int[] a) {\n List<List<Integer>> subsets = new ArrayList<>();\n // Add subset []\n subsets.add(new ArrayList<>());\n\n for (int n : a) {\n // Fix the current size\n int size = subsets.size();\n for (int i = 0; i < size; i++) {\n // Copy subset\n ArrayList<Integer> newSubset = new ArrayList<>(subsets.get(i));\n // Add element\n newSubset.add(n);\n subsets.add(newSubset);\n }\n }\n\n return subsets;\n}\n
"},{"location":"#technique-dealing-with-cycles-in-a-linked-list-or-an-array","title":"Technique - Dealing with cycles in a linked list or an array","text":"Runner technique
"},{"location":"#technique-find-all-the-permutations-or-combinations","title":"Technique - Find all the permutations or combinations","text":"Subsets technique or recursion + backtracking
"},{"location":"#technique-find-an-element-in-a-sorted-array-or-linked-list","title":"Technique - Find an element in a sorted array or linked list","text":"Binary search
"},{"location":"#technique-find-or-calculate-something-among-all-the-contiguous-subarrays-of-a-given-size","title":"Technique - Find or calculate something among all the contiguous subarrays of a given size","text":"Sliding window technique
Example: - Given an array, find the average of all subarrays of size \u2018K\u2019 in it
"},{"location":"#technique-find-the-longestshortest-substring-or-subarray","title":"Technique - Find the longest/shortest substring or subarray","text":"Sliding window technique
Example: - Longest substring with K distinct characters - Longest substring without repeating characters
"},{"location":"#technique-find-the-smallestlargestmedian-element-of-a-set","title":"Technique - Find the smallest/largest/median element of a set","text":"Two heaps technique
"},{"location":"#technique-finding-a-certain-element-in-a-linked-list-eg-middle","title":"Technique - Finding a certain element in a linked list (e.g. middle)","text":"Runner technique
"},{"location":"#technique-given-a-sorted-array-find-a-set-of-elements-that-fullfill-certain-conditions","title":"Technique - Given a sorted array, find a set of elements that fullfill certain conditions","text":"Two pointers technique
Example: - Given a sorted array and a target sum, find a pair in the array whose sum is equal to the given target - Given an array of unsorted numbers, find all unique triplets in it that add up to zero - Comparing strings containing backspaces
"},{"location":"#technique-given-an-array-of-size-n-containing-integer-from-1-to-n-eg-with-one-duplicate","title":"Technique - Given an array of size n containing integer from 1 to n (e.g. with one duplicate)","text":"Cyclic sort technique
"},{"location":"#technique-given-time-intervals","title":"Technique - Given time intervals","text":"Traversal technique
Iterate with two pointers, one over the starts, another one over the ends
Handle the element with the lowest value first and generate an event
Example: how many rooms for n meetings => meeting started, meeting started, meeting ended etc.
"},{"location":"#technique-how-to-get-the-k-biggestsmallestfrequent-elements","title":"Technique - How to get the K biggest/smallest/frequent elements","text":"Top K elements technique
"},{"location":"#technique-optimization-problems-requiring-a-min-or-max_1","title":"Technique - Optimization problems requiring a min or max","text":"Greedy technique
"},{"location":"#technique-problems-featuring-a-list-of-sorted-arrays-merge-or-find-the-smallest-element","title":"Technique - Problems featuring a list of sorted arrays (merge or find the smallest element)","text":"K-way merge technique
"},{"location":"#technique-scheduling-problem-with-n-tasks-where-each-task-can-have-constraints-to-be-completed-before-others","title":"Technique - Scheduling problem with n tasks where each task can have constraints to be completed before others","text":"Topological sort technique
"},{"location":"#technique-situations-like-priority-queue-or-scheduling","title":"Technique - Situations like priority queue or scheduling","text":"Heap data structure
Possibly two heaps technique
"},{"location":"#top-k-elements-technique-biggest-and-smallest","title":"Top K elements technique (biggest and smallest)","text":"Finding the K biggest elements: - Min heap - Add k elements - Then iterate over the remaining elements, if current > min => remove min, add current
Finding the k smallest elements: - Max heap - Add k elements - Then iterate over the remaining elements, if current < max => remove max, add current
"},{"location":"#topological-sort-technique_1","title":"Topological sort technique","text":"If there is an edge from U to V, then U <= V
Possible only if the graph is a DAG
Algo: - Create a graph representation (adjacency list) and an in degree counter (Map) - Zero them for each vertex - Fill the adjacency list and the in degree counter for each edge - Add in a queue each vertex whose in degree count is 0 (source vertex with no parent) - While the queue is not empty, poll a vertex from it then decrement the in degree of its children (no removal)
To check if there is a cycle, we must compare the size of the produced array to the number of vertices
List<Integer> sort(int vertices, int[][] edges) {\n if (vertices == 0) {\n return Collections.EMPTY_LIST;\n }\n\n List<Integer> sorted = new ArrayList<>(vertices);\n // Adjacency list graph\n Map<Integer, List<Integer>> graph = new HashMap<>();\n // Count of incoming edges for each vertex\n Map<Integer, Integer> inDegree = new HashMap<>();\n\n for (int i = 0; i < vertices; i++) {\n inDegree.put(i, 0);\n graph.put(i, new LinkedList<>());\n }\n\n // Init graph and inDegree\n for (int[] edge : edges) {\n int parent = edge[0];\n int child = edge[1];\n\n graph.get(parent).add(child);\n inDegree.put(child, inDegree.get(child) + 1);\n }\n\n // Create a source queue and add each source (a vertex whose inDegree count is 0)\n Queue<Integer> sources = new LinkedList<>();\n for (Map.Entry<Integer, Integer> entry : inDegree.entrySet()) {\n if (entry.getValue() == 0) {\n sources.add(entry.getKey());\n }\n }\n\n while (!sources.isEmpty()) {\n int vertex = sources.poll();\n sorted.add(vertex);\n\n // For each vertex, we will decrease the inDegree count of its children\n List<Integer> children = graph.get(vertex);\n for (int child : children) {\n inDegree.put(child, inDegree.get(child) - 1);\n if (inDegree.get(child) == 0) {\n sources.add(child);\n }\n }\n }\n\n // Topological sort is not possible as the graph has a cycle\n if (sorted.size() != vertices) {\n return new ArrayList<>();\n }\n\n return sorted;\n}\n
"},{"location":"#traversal-technique","title":"Traversal technique","text":"Traverse the input and generate another data structure or optional events
Start the problem from this new state
"},{"location":"#two-heaps-technique_1","title":"Two heaps technique","text":"Keep two heaps: - A max heap for the first half - Then a min heap for the second half
May be required to balance them to have at most a difference in terms of size of 1
"},{"location":"#two-pointers-technique","title":"Two pointers technique","text":"Two pointers iterating through the data structure in tandem until one or both pointers hit a certain condition
Often useful when structure is sorted. If not sorted, we may want to sort it first.
Most of the times (not always): first pointer is at the start, the second pointer is at the end
The two pointers can also be on two different ds, still iterating in tandem (e.g. comparing strings containing backspaces)
Time complexity is linear
"},{"location":"#what-if-we-need-to-iterate-backwards-on-a-singly-linked-list-in-constant-space-without-mutating-the-input_1","title":"What if we need to iterate backwards on a singly linked list in constant space without mutating the input?","text":"Reverse the liked list (or a subpart only), implement the algo then reverse it again to the initial state
"},{"location":"#tree","title":"Tree","text":""},{"location":"#2-3-tree","title":"2-3 tree","text":"Self-balanced BST => O(log n) complexity
Either: - 2-node: contains a single value and has two children - 3-node: contains two values and has three children - Leaf: 1 or 2 keys
Insert: find proper leaf and insert the value in-place. If the leaf has 3 values (called temporary 4-node), split the node into three 2-node and insert the middle value into the parent.
"},{"location":"#avl-tree","title":"AVL tree","text":"If tree is not balanced, rearange the nodes with single or double rotations
"},{"location":"#b-tree-complexity-access-insert-delete_1","title":"B-tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#b-tree-definition-and-use-case","title":"B-tree: definition and use case","text":"Self-balanced BST => O(log n) complexity
Can have more than two children (generalization of 2-3 tree)
Use-case: huge amount of data that cannot fit in main memory but disk space.
Height is kept low to reduce the disk accesses.
Match how page disk are working
"},{"location":"#balanced-binary-tree-definition","title":"Balanced binary tree definition","text":"The balance factor of each node (the difference between the two subtree heights) should never exceed 1
Guarantee of O(log n) search
"},{"location":"#balanced-bst-use-case-b-tree-red-black-tree-avl-tree","title":"Balanced BST use case: B-tree, Red-black tree, AVL tree","text":"BFS: time O(v), space O(v)
DFS: time O(v), space O(h) (height of the tree)
"},{"location":"#binary-tree-bfs-traversal","title":"Binary tree BFS traversal","text":"Level order traversal (level by level)
Iterative algorithm: use a queue, put the root, iterate while queue is not empty
Queue<Node> queue = new LinkedList<>();\nqueue.add(root);\n\nwhile(!queue.isEmpty()) {\n Node node = queue.poll();\n visit(node);\n\n if(node.left != null) {\n queue.add(node.left);\n }\n if(node.right != null) {\n queue.add(node.right);\n }\n}\n
"},{"location":"#binary-tree-definition","title":"Binary tree definition","text":"Tree with each node having up to two children
"},{"location":"#binary-tree-dfs-traversal-in-order-pre-order-and-post-order","title":"Binary tree DFS traversal: in-order, pre-order and post-order","text":"It's depth first so:
Every level of the tree is fully filled, with last level filled from the left to the right
"},{"location":"#binary-tree-full","title":"Binary tree: full","text":"Each node has 0 or 2 children
"},{"location":"#binary-tree-perfect","title":"Binary tree: perfect","text":"2^l - 1 nodes with l the level: 1, 3, 7, etc. nodes
Every level is fully filled
"},{"location":"#bst-complexity-access-insert-delete","title":"BST complexity: access, insert, delete","text":"If not balanced O(n)
If balanced O(log n)
"},{"location":"#bst-definition","title":"BST definition","text":"Binary tree in which every node must fit the property: all left descendents <= n < all right descendents
Implementation: optional key, value, left, right
"},{"location":"#bst-delete-algo-and-complexity_1","title":"BST delete algo and complexity","text":"Find inorder successor and swap it
Average: O(log n)
Worst: O(h) if not self-balanced BST, otherwise O(log n)
"},{"location":"#bst-insert-algo","title":"BST insert algo","text":"Search for key or value (by recursively going left or right depending on the comparison) then insert a new node or reset the value (no swap)
Complexity: worst O(n)
public TreeNode insert(TreeNode root, int a) {\n if (root == null) {\n return new TreeNode(a);\n }\n\n if (root.val <= a) { // Left\n root.left = insert(root.left, a);\n } else { // Right\n root.right = insert(root.right, a);\n }\n\n return root;\n}\n
"},{"location":"#bst-questions-prerequisite","title":"BST questions prerequisite","text":"Is it a self-balanced BST? (impacts: O(log n) time complexity guarantee)
"},{"location":"#complexity-to-create-a-trie_1","title":"Complexity to create a trie","text":"Time and space: O(n * l) with n the number of words and l the longest word length
"},{"location":"#complexity-to-insert-a-key-in-a-trie_1","title":"Complexity to insert a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative, O(k) recursive
"},{"location":"#complexity-to-search-for-a-key-in-a-trie_1","title":"Complexity to search for a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative or O(k) recursive
"},{"location":"#given-a-binary-tree-algorithm-to-populate-an-array-to-represent-its-level-by-level-traversal","title":"Given a binary tree, algorithm to populate an array to represent its level-by-level traversal","text":"Solution: BFS by popping only a fixed number of elements (queue.size)
public static List<List<Integer>> traverse(TreeNode root) {\n List<List<Integer>> result = new LinkedList<>();\n Queue<TreeNode> queue = new LinkedList<>();\n queue.add(root);\n while (!queue.isEmpty()) {\n List<Integer> level = new ArrayList<>();\n\n int levelSize = queue.size();\n // Pop only levelSize elements\n for (int i = 0; i < levelSize; i++) {\n TreeNode current = queue.poll();\n level.add(current.val);\n if (current.left != null) {\n queue.add(current.left);\n }\n if (current.right != null) {\n queue.add(current.right);\n }\n }\n result.add(level);\n }\n return result;\n}\n
"},{"location":"#how-to-calculate-the-path-number-of-a-node-while-traversing-using-dfs","title":"How to calculate the path number of a node while traversing using DFS?","text":"Example: 1 -> 7 -> 3 gives 173
Solution: sum = sum * 10 + n
private int dfs(TreeNode node, int sum) {\n if (node == null) {\n return 0;\n }\n\n sum = 10 * sum + node.val;\n\n // Do something\n}\n
"},{"location":"#min-or-max-value-in-a-bst","title":"Min (or max) value in a BST","text":"Move recursively on the left (on the right)
"},{"location":"#red-black-tree","title":"Red-Black tree","text":"Self-balanced BST => O(log n) complexity
Binary Trees: Red Black by David Pynes
"},{"location":"#red-black-tree-complexity-access-insert-delete_1","title":"Red-black tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#reverse-a-binary-tree-algo","title":"Reverse a binary tree algo","text":"public void reverse(Node node) {\n if (node == null) {\n return;\n }\n\n Node temp = node.right;\n node.right = node.left;\n node.left = temp;\n\n reverse(node.left);\n reverse(node.right);\n}\n
"},{"location":"#trie-definition-implementation-and-use-case","title":"Trie definition, implementation and use case","text":"Tree-like data structure with empty root and where each node store characters
Each path down the tree represent a word (until a null node that represents the end of the word)
Usually implemented using a map of children (or a fixed size array with ASCII charset for example)
Use case: dictionnary (save memory)
Also known as prefix tree
"},{"location":"#why-to-use-bst-over-hash-table","title":"Why to use BST over hash table","text":"Sorted keys
#tree
"},{"location":"anki/","title":"Anki","text":"Anki is a free software (Windows/Mac/Linux/iPhone/Android) designed to help remembering information. Anki relies on the concept of spaced repetition which is a proven technique to increase the rate of memorization. Here's a 2-minute video that delves into spaced repetition:
Michael A. Nielsen, \"Augmenting Long-term Memory\"
The single biggest change that Anki brings about is that it means memory is no longer a haphazard event, to be left to chance. Rather, it guarantees I will remember something, with minimal effort. That is, Anki makes memory a choice.
I used Anki myself with Algo Deck and Design Deck and it paid off. This method played a key role in helping me land a role as L5 SWE at Google (senior software engineer).
Here is a flashcard example:
The Anki versions (a clone of the flashcards from this repo) are available via a one-time GitHub sponsorship:
Trusted by over 100 developers.
"},{"location":"designdeck/","title":"Design Deck","text":"AnkiCheck the Anki version here.
"},{"location":"designdeck/#cache","title":"Cache","text":""},{"location":"designdeck/#cache-aside","title":"Cache aside","text":"Application is responsible for reading and writing to the DB (using write-through or write-back policy)
The cache doesn't interact with the storage directly
"},{"location":"designdeck/#cache-aside-vs-read-through","title":"Cache aside vs. read-through","text":"Cache aside: - Data model can be different from DB
Read-through: - Same data model as DB - Can use the refresh-ahead pattern
"},{"location":"designdeck/#cache-eviction-policy","title":"Cache eviction policy","text":"Cache to automatically refresh any recently accessed entry prior to its expiration
Used with read-through cache
Main difference: consistency
Write through: 1. Write to the cache and the DB in a single DB transaction (may still lead to cache inconsistency if the DB commit failed) 2. Return
Write back: 1. Write to the cache 2. Return 3. Asynchronously store in DB
"},{"location":"designdeck/#four-main-distributed-cache-benefits","title":"Four main distributed cache benefits","text":"Cache hit ratio: hits / total accesses
"},{"location":"designdeck/#read-through-cache","title":"Read-through cache","text":"Read-through cache sits in-line with the DB
Single entry point
"},{"location":"designdeck/#when-to-use-a-cache","title":"When to use a cache","text":"Content Delivery Network
Network of geographically dispersed servers used to deliver static content (images, CSS, Javascript files, etc.)
Two kinds of CDNs: - Push CDN: we are responsible for providing the content - Pull CDN: CDN is responsible for pulling the right content (expiration to be used)
Pull is easier to handle whereas push gives us more flexibility
Use case for pull: Docker Hub S3 layer
"},{"location":"designdeck/#db","title":"DB","text":""},{"location":"designdeck/#3-main-reasons-to-partition-data","title":"3 main reasons to partition data","text":"Atomic: all transaction succeeds or none does (all or nothing)
Consistency: from one valid state to another (invariants must always be true)
Not necessarily a property of the DB (e.g., foreign key constraint), can be a property of the application (e.g., credits and debits must be balanced)
Different from consistency in eventual consistency (which is more about convergence as the matter is replicating data)
Refers to serializability
Optimization to favor latency over consistency when writing to a DB (e.g., leaderless replication)
Background process to constantly looks for differences in data
Could be used as an alternative or in conjunction with read repair
"},{"location":"designdeck/#byzantine-fault-tolerant","title":"Byzantine fault-tolerant","text":"A system is Byzantine fault-tolerant if it continues to operate correctly if in the case of a Bizantine's problem (some of the nodes malfunctioning, not obeying the protocol or malicious attackers).
"},{"location":"designdeck/#calm-theorem","title":"CALM theorem","text":"Consistency As Logical Monotonicity
A program has a consistent, coordination-free (e.g., consensus-free) distributed implementation if and only if it is monotonic
Consistency in this context doesn't mean linearizability. It focuses on the consistency of the program's output while traditional consistency focus on the consistency of reads and writes.
In CALM, a consistent program is one that produces the same output no matter in which order the inputs are processed and despite any conflicts.
Said differently, does the implementation produce the outcome we expect despite any race condition that may arise.
"},{"location":"designdeck/#cap-theorem","title":"CAP theorem","text":"Consistency, availability, partition tolerance (e.g., one node cut off from the rest of the cluster because of a network partition) => pick 2 out of 3
C refers to linearizability
"},{"location":"designdeck/#caveat-of-serializability","title":"Caveat of serializability","text":"It's possible that serial order is different from the order in which transactions were actually run (latest may not win)
If not, we need a stricter isolation level: strict serializability (serializability + linearizability)
"},{"location":"designdeck/#chain-replication","title":"Chain replication","text":"Replication protocol that uses a different topology than leader based replication protocols like Raft
Left-most process referred as the chain's head, right-most as the chain's tail: - Client send writes to the head, which updates its local state and forwards to the next process in the chain - Next process updates its local state and forwards to the next process in the chain - Etc. - Once the update is received by the tail, the ack flows back to the head which replies to the client that the write succeeded
Fault tolerance is delegated to a dedicated component: control plane - If head fails: the control plane removes it and makes the next as the head - If intermediate node fails: the control plane removes it temporarily from the chain, and then adds it back eventually as the tail - If tail fails: the control plane removes it and makes the predecessor as the new tail
Benefits: - Strongly consistent protocol - Reads are served from the tail without contacting other replicas first which allows a lower response time
Drawbacks: - Writes are slower than quorum-based replication. - A single slow node can slow down all writes. - As reads are served from a single node, it can't be scaled horizontally. A mitigation is to allow intermediate nodes to serve reads but they can do it only if a read is considered as clean (the ack for this object has been returned to the predecessor). // The tail serves as the authority of the latest clean version
Notes: - To avoid the overhead of having a single node handling the writes, we can find a way to shard data and handle multiple chains (see https://engineering.fb.com/2022/05/04/data-infrastructure/delta/)
"},{"location":"designdeck/#chain-replication-vs-consensus","title":"Chain replication vs. consensus","text":"Similar consistency guarantees
Chain replication: - Optimized for reads for CP systems - Better read availability: a chain of n nodes can tolerate up to n-2 nodes failure
Example with 5 nodes: - Chain replication: tolerate up to 3 nodes failure - Consensus with R=3 and W=3: tolerate up to 2 nodes failure
Consensus: - Optimized for writes for CP systems
"},{"location":"designdeck/#change-data-capture-cdc","title":"Change data capture (CDC)","text":"A datastore is selected as the authoritative source of data where all update operations are performed
An event log is then created from this datastore that is consumed by all the remaining operations the same way as in event sourcing
"},{"location":"designdeck/#concurrency-control","title":"Concurrency control","text":"Ensures that correct results for concurrent operations are generated
Pessimistic: lock (mutual exclusion)
Optimistic: checks for conflicts at the end of a transaction
In the end, concurrency control serves the same purpose as atomicity
"},{"location":"designdeck/#consensus","title":"Consensus","text":"Set of processes agreeing on some data value in a fault-tolerant way
Satisfies safety and liveness
"},{"location":"designdeck/#consistency-models","title":"Consistency models","text":"Describe what expectations clients might have in terms of possible returned values despite the existence of multiple copies of data and concurrent access to it
Not the C in ACID but the C in CAP (converging to an end state)
Eventual consistency: all the nodes converge to the same state (not necessarily the latest)
Write follow reads: ensures that writes are ordered after writes that were observed by previous read operations
Example: - P1 reads value => foo - P1 updates value to bar => Every node will converge to bar (a process can't read bar, then foo, regardless of the process) Also known as session causality
Monotonic reads consistency: a client doing several reads in sequence will never go backward in time
Monotonic writes consistency: values originating from the same client appear in the order the client has executed them
Read-after-write-consistency: if a client performs a write, then this write if visible during subsequent reads
Also known as read-your-writes consistency
Causal consistency: operations that are causally related need to be seen in the same order by all the nodes
Sequential consistency: operations appear to take place in some total order, and that order is consistent with the order of operations from each individual clients
Twitter example: no guarantee between which tweet is seen first between two friends posting at the same time, but the ordering is guaranteed for the same friend
Even though there may be multiple replicas, the application does not need to worry about them
C in CAP
Real time guarantees
"},{"location":"designdeck/#cqrs","title":"CQRS","text":"Command Query Responsibility Segregation
Dissociate writes (command) from reads (query)
Pros: - Allows creating stores per use case (e.g., analytics, geospatial) - Scale the read and write parts independently
Cons: - Eventual consistency between the stores
"},{"location":"designdeck/#crdt","title":"CRDT","text":"Conflict-free Replicated Data Types
Data structure that is replicated across nodes: - Replicas are updated independently, concurrently and without coordination - An algo (part of the data type) can perform a deterministic conflict resolution - Replicas are guaranteed to eventually converge to the same state => Strong eventual consistency
Used in the context of collaborative applications
Note: CRDTs can be combined to form new CRDTs
"},{"location":"designdeck/#crdt-and-collaborative-applications-eg-google-docs","title":"CRDT and collaborative applications (e.g., Google Docs)","text":"Compared to OT, each character has a stable identifier (even if characters are added or deleted)
Example: 0 is the beginning of the document, 1 is the end of the document, every character has a fractional number as an ID
May lead to interleaving problems (e.g;, two inserted words by two users are interleaved: \"Alice\", \"Bob\" => \"BoAlibce\"
Interleaving depends on the merging algorithm used (e.g., Treedoc doesn't lead to interleaving)
"},{"location":"designdeck/#db-indexes-tradeoff","title":"DB indexes tradeoff","text":"Speed up read query but slow down writes
"},{"location":"designdeck/#db-internal-components","title":"DB internal components","text":"Optimized state-based CRDTs where only recently applied changes to a state are replicated instead of the full state
"},{"location":"designdeck/#denormalization","title":"Denormalization","text":"Introduce some amount of duplication in a normalized dataset in order to speed up reads (e.g., denormalized document, cache or index)
Cons: - Requires more space - May slow down writes
"},{"location":"designdeck/#design-consideration-when-partitioning-data","title":"Design consideration when partitioning data","text":"Should match the primary access pattern
"},{"location":"designdeck/#downside-of-distributed-transactions","title":"Downside of distributed transactions","text":"Performance penalty
Example: distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions
"},{"location":"designdeck/#event-sourcing","title":"Event sourcing","text":"Ensures that all changes to application state are stored as a sequence of events
"},{"location":"designdeck/#eventual-consistency-requirements","title":"Eventual consistency requirements","text":"Splits up DB by function
"},{"location":"designdeck/#fencing-token","title":"Fencing token","text":"Monotonically increasing token that increments whenever a client acquires a distributed lock
Use case: when writing to a DB, if the provided token has a lower value than the current one, rejects the write
Solve possible issues with lease as an update has to be made from the latest token
"},{"location":"designdeck/#gossip-protocol","title":"Gossip protocol","text":"Peer-to-peer protocol based on the way epidemics spread
No central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors
Useful when broadcasting to a large number of processes like thousands or more, where a deterministic protocol wouldn't scale
"},{"location":"designdeck/#graph-db-main-use-case","title":"Graph DB main use case","text":"Relational can handle simple cases of many-to-many relationships
Yet, if the connections become more complex, it's more natural to start modeling data as a graph
"},{"location":"designdeck/#hinted-handoff","title":"Hinted handoff","text":"Optimization to favor latency over consistency when writing to a DB
If a coordinator node cannot contact the necessary number of replicas, it stores locally the result of the operation and forward it to the failed node(s) after they recovered
Used in sloppy quorums
"},{"location":"designdeck/#hot-spot-in-partitioning","title":"Hot spot in partitioning","text":"Partition is heavily loaded compared to others
Also called skew
"},{"location":"designdeck/#in-a-database-strategy-to-handle-rebalancing","title":"In a database, strategy to handle rebalancing","text":"Not based on key hashing as a rebalancing would be huge
Simple solution: Create many more partitions than nodes and assign several partitions to each node (e.g., a db running on a cluster of 10 nodes may be split into 10k partitions). When a node is added to the cluster, it will steal a few partitions from every existing node
"},{"location":"designdeck/#isolation-levels","title":"Isolation levels","text":"Degree to which transactions are isolated from other concurrent execution transactions
Isolations come at a performance cost (more coordination and synchronization)
=> Can violate integrity constraints
=> decisions can be taken based on data updates that can be rolled back
Fuzzy reads: a transaction reads a value twice but sees a different value in each read because a committed transaction updated the value between the two reads
Lost updates: two transactions reads the same value and then try to update it to two different values, only one update survives
Example: Two transactions read the current inventory size (say 100 items), add respectively 5 and 10 items and then store back the size. Depending on the execution order, then final order can be 110 instead of 115.
Read skew: an integrity constraint seems to be violated because a transaction can only see partial results of another transaction
Write skew: when two transactions read the same objects, and then updates some of those objects
Example: Two on-call doctors for a shift. Both feeling unwell, and decide to request leave. They both click the button at the same time. In the case of a write skew, the two transactions can succeed as for both, when reading the number of available doctors, it was more than one.
Example: Transaction A computes the max and average age of employees. Transaction B is interleaved and inserts a lot of old employees. Thus, the average age could be larger than the max.
"},{"location":"designdeck/#known-crdts","title":"Known CRDTs","text":"Counter: - Grow-only counter: increment only - Positive-negative counter: increment and decrement (combination of two grow only counter: one positive, one negative)
Register (a memory cell storing whatever): - LWW-register: total order using timestamps - Multi-value register: keep track of causality, in case of conflicts it returns all conflicting cases (analogy: Git with an interactive merge resolution)
Set: - Grow-only set: once an element is added it can't be removed - Two-phase set: elements can be added and removed (combination of two grow only set) - LWW-element set (last-write-wins): similar to two-phase set but we associate a timestamp for each element to resolve conflicts - Observed-remove set: use tags instead of timestamps; each element is associated to a list of add-tags and a list of remove-tags (example: vector clocks) - Sequence: used to build collaborative applications (e.g., Treedoc)
"},{"location":"designdeck/#last-write-wins-lww","title":"Last-write-wins (LWW)","text":"Conflict resolution based on timestamp
Used by DynamoDB or Cassandra to resolve conflicts
Shouldn't happen in single-master replication
"},{"location":"designdeck/#leader-election","title":"Leader election","text":"Algorithm to guarantee at most one leader at any given time (safety) and that an election eventually completes (liveness)
"},{"location":"designdeck/#lsm-tree","title":"LSM tree","text":"Log-Structured Merge tree
Consists of smaller mutable memory-resident (memtable) and larger immutable disk-resident (SSTable) components
Memtables data are sorted and flushed on disk when their size reaches a configurable threshold or periodically
Because of a memtable is just a special case of buffer, durability is not guaranteed (durability must be brought by replication)
Examples: Lucene, Cassandra, Bitcask, etc.
"},{"location":"designdeck/#lsm-tree-vs-b-tree","title":"LSM tree vs. B-tree","text":"LSM-tree faster for writes, slower for reads because it has to check multiple data structures (bigger read amplification): memtable and SSTable
Compaction can impact ongoing requests
B-tree faster for reads, slower for writes as it must write every piece of data at least twice in the WAL & tree itself (bigger write amplification)
Each key exists in exactly one place => easier to offer strong transactional semantics
"},{"location":"designdeck/#main-difference-between-consistency-models-and-isolation-levels","title":"Main difference between consistency models and isolation levels","text":"Consistency models: applies to single-object operations
Isolation levels: applies to multi-object operations
"},{"location":"designdeck/#merkle-tree","title":"Merkle tree","text":"A tree in which every leaf is labelled with the hash of a data block: - Level n contains the data blocks - Level n-1 the hash of one data block - Level n-2 the hash of 2 data blocks - Level 1 the hash of all the data blocks
Efficient and secure verification of the contents of a large data structure
Allows reducing data transfered between a client and a server. For example, if we want to compare a merkle tree stored on a server with one store on the client, they can both exchange their top hash. If different, we can delve in and only get the data blocks which have changed.
"},{"location":"designdeck/#monotonic-reads-consistency-implementation","title":"Monotonic reads consistency implementation","text":"One way to achieve it is to make sure each user always makes their reads from the same replica
"},{"location":"designdeck/#mvcc","title":"MVCC","text":"Multiversion Concurrency Control
A possible implementation of optimistic concurrency control and snapshot isolation level
MVCC allows reads and writes to proceed with minimal coordination on the storage level since reads can continue accessing older values until the new ones are committed
"},{"location":"designdeck/#n1-select-problem","title":"N+1 select problem","text":"Assuming a one-to-many relationship between 2 tables A and B => A 1-* B
If we want to iterate through all the A and for each one, print the list of B, the naive implementation would be: - select * from A
- And then for each A, select * from B where A_ID = ?
Alternatively, we could reduce the number of rount-trips to the DB from N+1 to 2 with a simple select * from B
Most ORM tools prevent N+1 selects
"},{"location":"designdeck/#nosql-main-types-and-main-architecture-principles","title":"NoSQL: main types and main architecture principles","text":"Key-value store, document store, column-oriented store or graph DB
Commutative replicated data types
Replication is made in propagating the update operation
Operations characteristics: - Must be commutative. - Not necessarily idempotent. If idempotent, OK. If not, it's up to the delivery layer to ensure the operations are delivered without duplication. - Delivered in causal order.
"},{"location":"designdeck/#operational-transformation-ot-concept-and-main-drawback","title":"Operational transformation (OT): concept and main drawback","text":"A way to handle collaborative applications
Receive update operations and depending on the operations that occur concurrently, transform them
Example: - Initial state: \"helo\" - Concurrently: user 1 inserts \"l\" at position 3 and user 2 inserts \"!\" at position 4 - If transaction for user 1 completes before the one of user 2, we end up with \"hell!o\" instead of \"hello!\" - OT will transorm the transaction from user 2 into: insert \"!\" at position 5
Drawback: all the communications go through a central server (e.g., impossible with systems at scale such as Google Docs)
Replaced with CRDT
"},{"location":"designdeck/#optimistic-concurrency-control-pros-and-cons","title":"Optimistic concurrency control: pros and cons","text":"Perform badly if high contention as it leads to a high proportion of retry, thus making performance worse
If not much contention, it tends to perform better than pessimistic
"},{"location":"designdeck/#pacelc-theorem","title":"PACELC theorem","text":"If case of a network partition (P): we should choose between availability (A) or consistency (C)
Else, in the absence of partition (E): we should choose between latency (L) or consistency (C)
Most systems are either: - AP/EL - CP/EC
"},{"location":"designdeck/#partitioning-sharding","title":"Partitioning (sharding)","text":"Split up a large dataset that is too big for a single machine into smaller parts and spread them across several machines
Define the partition type based on the primary access pattern
"},{"location":"designdeck/#partitioning-criteria","title":"Partitioning criteria","text":"Range partitioning: keys are sorted and a partition owns all the keys from some minimum up to some maximum (example: MySQL RANGE COLUMNS partitioning) - Pros: efficient range queries - Cons: Risk of hot spots, requires repartitioning to potentially split a range into two subranges if a partition gets too big
Hash partitioning: hash function is applied to each key and a partition owns a range of hashes
"},{"location":"designdeck/#partitioning-methods","title":"Partitioning methods","text":"Horizontal partitioning: partition by rows
Vertical partitioning: partition by columns (create tables with fewer columns)
Rationale: if the subtables have different access patterns (e.g., a column is a blob that we rarely consume, we can create a vertical partitioning to store this blob not on the primary disk)
Also called normalization
"},{"location":"designdeck/#quorum","title":"Quorum","text":"Minimum number of nodes that need to vote on an operation before it can be considered successful
Usually: majority
"},{"location":"designdeck/#raft","title":"Raft","text":"Leader election and replication algorithms
"},{"location":"designdeck/#leader-election_1","title":"Leader election","text":"Using a state machine to elect a leader
Each process is in one of these three states: leader, candidate (part of the election process), follower
"},{"location":"designdeck/#replication","title":"Replication","text":"The leader stores the sequence of operations altering the state into a local ordered log
Then, this log is replicated across followers Each entry is considered as committed when it has been replicated on a majority of nodes
Replication enables consensus
"},{"location":"designdeck/#read-repair","title":"Read repair","text":"Optimization to favor latency over consistency when writing to a DB (e.g., leaderless replication)
If a coordinator node receives conflicting values from the contacted replicas (which shouldn't happen in case of single-master replication for example), it resolves the conflict by: - Resolving the conflict (e.g., LWW) - Forwarding it to the stale replica - Responding to the read request
"},{"location":"designdeck/#relation-between-replication-factor-write-consistency-and-read-consistency","title":"Relation between replication factor, write consistency and read consistency","text":"Given: - N: number of replicas - W: number of nodes that have to ack a write for it to succeed - R: number of nodes that have to respond to a read operation for it to succeed
If R+W > N, the system can guarantee to return the most recent written value because there's always an overlap between read and write sets (consistency)
Notes: - In case of read-heavy systems, we want to minimize R - If W = 1 and R = N, durability isn't guaranteed in the presence of failure - If W < (N+1)/2, it may leads to write conflicts (e.g., W < 2 if 3 nodes) - If R+W <= N, weak/eventual consistency
"},{"location":"designdeck/#replication-vs-partition-impacts","title":"Replication vs. partition: impacts","text":"Replication: - Read-heavy - Availability > consistency
Partition: - Write-heavy (splitting up data across different shards)
"},{"location":"designdeck/#schema-on-read-vs-schema-on-write","title":"Schema-on-read vs. schema-on-write","text":"Schema-on-read: implicit schema but not enforced by the DB (also called schemaless but misleading)
Schema-on-write: explicit schema, the DB ensures all writes are conforming to it (e.g., relational DB)
"},{"location":"designdeck/#serializability","title":"Serializability","text":"I in ACID (strong isolation level)
Equivalent to serial execution (no interleaving due to concurrent transactions)
"},{"location":"designdeck/#serializable-snapshot-isolation-ssi","title":"Serializable Snapshot Isolation (SSI)","text":"Snapshot Isolation (SI) allows write skew
SSI is a stricter isolation level than SI preventing write skew: check at runtime for conflicts between transactions
Downside: increase the number of aborted transactions
"},{"location":"designdeck/#single-leader-multi-leader-leaderless-replication","title":"Single-leader, multi-leader, leaderless replication","text":""},{"location":"designdeck/#single-leader","title":"Single-leader","text":"All writes go through one leader
Pro: ensure consistency
Con: all writes go through a single node (bottleneck)
"},{"location":"designdeck/#multi-leader","title":"Multi-leader","text":"Rarely makes sense within a single datacenter (benefits rarely outweigh the added complexity) but used in multi-datacenter contexts
DB must resolve the conflicts in a convergent way
Use cases: - One leader per datacenter
Different topologies:
Most used: all-to-all
Pro: not limited to the write throughput of a single node
Con: possible write conflicts
"},{"location":"designdeck/#leaderless-replication","title":"Leaderless replication","text":"Client sends its writes to several replicas in parallel
Read requests are also sent in parallel to multiple replicas (this way, if a write hasn't been replicated yet to one replica, it won't lead to stale data)
Rely on read repair and anti-entropy mechanisms
Rely on quorum to know how long to wait for a request (not perfect: if a write fails because we didn't reach a quorum, what shall we do about the replicas where the write has already been committed)
Examples: Cassandra, DynamoDB, Riak
Pro: throughput
Con: quorums are not perfect, provide illusion of strong consistency when in reality, it's often not true
"},{"location":"designdeck/#sloppy-quorum","title":"Sloppy quorum","text":"In case of a quorum of w nodes to accept a write: if we can't reach w, the DB accepts the write replicate it to nodes that aren't among the ones on which the value usually lives
Relies on hinted handoff
"},{"location":"designdeck/#snapshot-isolation-si","title":"Snapshot Isolation (SI)","text":"Guarantee that all reads made in a transaction will see a consistent snapshot of the database
In practice, it reads the last committed values that existed at the time it started
Allows write skew
"},{"location":"designdeck/#snapshot-isolation-common-implementation","title":"Snapshot Isolation common implementation","text":"MVCC
"},{"location":"designdeck/#sstable","title":"SSTable","text":"Sorted String Table, immutable components of a LSM tree
Sorted immutable data structure
It consists of 2 components: index files and data files
The index (based on a hashtable or a B-tree) holds the keys and the data entries (offsets in the data file where the actual records are located)
Data files hold records in key order
"},{"location":"designdeck/#state-based-crdts-definition-and-requirements","title":"State-based CRDTs: definition and requirements","text":"Convergent replicated data types
Replication is made in propagating the full local state to replicas
States are merged with a function which must be: - Commutative - Idempotent - Associative => Update monotonically increase the internal state according to some partial order rules defined (e.g., max of two values, union of two sets)
=> Delivery layer doesn't have to guarantee causal ordering nor idempotency, only eventual delivery
"},{"location":"designdeck/#strong-eventual-consistency-definition-and-requirements","title":"Strong eventual consistency: definition and requirements","text":"Stronger guarantee than eventual consistency
Based on the fact that we can define a deterministic outcome for any conflict
Requires: - Eventual delivery: every update applied to a replica is eventually applied to all replicas - Strong convergence: guarantees that replicas that have executed the same updates have the same state (with eventual consistency, the guarantee is that the replicas eventually reach the same state, once consensus is reached)
Strong convergence requires convergent replicated data types (part of CRDT family)
Main difference with eventual consistency: - Leaderless replication - No consensus needed, instead, it relies on a deterministic outcome for any conflict
A solution to the CAP theorem
"},{"location":"designdeck/#three-phase-commit-3pc","title":"Three-phase commit (3PC)","text":"Failure-resilient refinement of 2PC
Unlike 2PC, satisfies liveness but not safety
"},{"location":"designdeck/#transaction","title":"Transaction","text":"A unit of work performed in a database system, representing a change, which can be potentially composed of multiple operations
"},{"location":"designdeck/#two-main-approaches-to-partition-a-table-that-has-secondary-indexes","title":"Two main approaches to partition a table that has secondary indexes","text":"Partitioning secondary indexes by document: - Each partition maintains its own secondary index - Write: one partition - Query on the index: requires querying multiple partitions (scatter/gather)
Optimized from writes
Example: Elasticsearch, MongoDB, Cassandra, Riak, etc.
Partitioning secondary indexes by term: - Global index covering all the partitions (to be replicated) - Write: multiple partitions are updated (for resiliency) - Query on the index: served from one partition containing the index
Optimized from reads
"},{"location":"designdeck/#two-types-of-crdts","title":"Two types of CRDTs","text":"Operation-based and state-based
Operation-based require less bandwidth
State based require less assumptions about the delivery layer
"},{"location":"designdeck/#two-phase-commit-2pc","title":"Two-phase commit (2PC)","text":"Protocol used to implement atomic transaction commits across multiple processes
Satisfies safety but not liveness
"},{"location":"designdeck/#wal","title":"WAL","text":"Write-ahead log (or redo log)
Append-only file to which every modification must be written
Used for restoration in the event of a DB crash: - Durability - Atomicity (allows to identify the operations on progress and complete or undo them)
"},{"location":"designdeck/#when-relational-vs-when-document","title":"When relational vs. when document","text":"Relational (schema-on-write): - Better support for joins - Many-to-one and many-to-many relationships - ACID
Document (schema-on-read): - Schema flexibility - Better performance due to locality - Closer to the data structures used by the application - In general not ACID - In general write-heavy
"},{"location":"designdeck/#when-to-use-a-column-oriented-store","title":"When to use a column-oriented store","text":"Because columns are stored contiguously: analytical workloads (computing average values, finding trends, etc.)
Flexible schema
Limited space (storing same data type together offers a better compression ratio)
"},{"location":"designdeck/#why-db-schemaless-is-misleading","title":"Why DB schemaless is misleading","text":"There is an implicit schema but not enforced by the DB
More accurate term: schema-on-read
Different from relational DB with shema-on-write where the schema is explicit and the DB ensures all written data conforms to it
Similar to dynamic vs. static type checking in a programming language
"},{"location":"designdeck/#why-is-in-memory-faster","title":"Why is in-memory faster","text":"Not necessarily because they don't need to read from disk (even a disk-based storage engine may never need to read from disk if enough memory)
Can be faster because they avoid the overhead of encoding in a form that can be written to disk
"},{"location":"designdeck/#write-and-read-amplification","title":"Write and read amplification","text":"Ratio of the amount of data written/read to the disk versus the amount of data intended to be written
"},{"location":"designdeck/#write-heavy-and-replication-type","title":"Write heavy and replication type","text":"Do not rely on single-master replication as it heavily impacts the scaling of write-heavy systems
Instead, rely on leaderless replication
Trade off: consistency is harder to guarantee
"},{"location":"designdeck/#design","title":"Design","text":""},{"location":"designdeck/#auditing","title":"Auditing","text":"Checking the integrity of data
"},{"location":"designdeck/#backward-vs-forward-compatibility","title":"Backward vs. forward compatibility","text":""},{"location":"designdeck/#bloom-filter","title":"Bloom filter","text":"Probabilistic, memory-efficient data structure for approximating the content of a set
Can tell if a key does not appear in the DB
"},{"location":"designdeck/#causality","title":"Causality","text":"Causal dependency: one event causing another
Happened-before relationship
"},{"location":"designdeck/#concurrent-operations","title":"Concurrent operations","text":"Not only operations that happen at the same time but also operations made without knowing about each other
Example: - Concurrent to-do list operations with a current \"Buy milk\" item - User 1 deletes it - User 2 doesn't have an internet connection, modifies it into \"Buy soy milk\", and then is connected again => this modification may have been done one hour after user 1 deletion
"},{"location":"designdeck/#consistent-hashing","title":"Consistent hashing","text":"Special kind of hashing such that when a resize occurs, only 1/n percent of the keys need to be rebalanced (n: number of nodes)
Solutions: - Ring consistent hash with virtual nodes to improve the distribution - Jump consistent hash: faster but nodes must be numbered sequentially (e.g., if we have 3 servers foo, bar, and baz => we can't decide to remove bar)
"},{"location":"designdeck/#design-impacts-of-sharing","title":"Design impacts of sharing","text":"May decrease: - Availability - Performance - Scalability
"},{"location":"designdeck/#design-read-heavy-vs-write-heavy-impacts","title":"Design: read-heavy vs. write-heavy impacts","text":"Read heavy: - Leverage replication - Leverage denormalization
Write heavy: - Leverage partition (usually) - Leverage normalization
"},{"location":"designdeck/#different-types-of-message-failure","title":"Different types of message failure","text":"Event log: - Consumers are free to select the point of the log they want to consume messages from, which is not necessarily the head - Log is immutable, messages cannot be removed by consumers (removed by a GC running periodically)
"},{"location":"designdeck/#exactly-once-delivery","title":"Exactly-once delivery","text":"Impossible to achieve
However, we can achieve exactly-once processing using a dedup or by requiring the consumers to be idempotent
"},{"location":"designdeck/#flp-impossibility","title":"FLP impossibility","text":"In an asynchronous distributed system, there's no consensus algorithm that can satisfy: - Agreement - Validity - Termination - And fault tolerance
"},{"location":"designdeck/#geohashing","title":"Geohashing","text":"Encode geographic coordinates into a short string called a cell with varying resolutions
The more letters in the string, the more precise the location
Main use case: - Proximity searches in O(1)
"},{"location":"designdeck/#hashing-definition-and-size-of-md5-and-sha256","title":"Hashing definition and size of MD5 and SHA256","text":"Map data of arbitrary size to fixed-size values
Examples: - MD5: 16 bytes - SHA256: 32 bytes
"},{"location":"designdeck/#hdfs","title":"HDFS","text":"Distributed filesystem: - Fault tolerant - Scalable - Optimised for batch operations
Architecture: - Single master (maintain filesystem metadata, inform clients about which server store a specific part of a file) - Multiple data nodes
Leverage: - Partitioning: each file is partitioned into multiple chunks => performance - Replication => availability
Read: communicates with the master node to identify the servers containing the relevant chunks
Write: chain replication
"},{"location":"designdeck/#how-to-reduce-sharing","title":"How to reduce sharing","text":"Used to approximate cardinality of a set
Optimization for space over perfect accuracy
"},{"location":"designdeck/#backing-idea","title":"Backing idea","text":"Coin flip game: you flip a coin, if head, flip again, if tail stop
If a player reaches n flips, it means that on average, he tried 2n+1 times
"},{"location":"designdeck/#algo","title":"Algo","text":"For an ID, we will count how many consecutive 0 (head) bits on the left
Example: 001110 => 2
Hence, on average we should have seen 22+1 visitors
Requirement: we need visitors ID to be uniform => either if the ID is randomly generated or by hashing them (if ID is auto incremented for example)
Required memory: log(log(m)) with m the number of unique visitors
Problem with this algo: it depends on luck. For example, if user 00000001 connects every day => the system will always approximate 28 visitors
"},{"location":"designdeck/#bucketing","title":"Bucketing","text":"Distribute to multiple counters and aggregate the results (possible because each counter is very small)
If we want 4 counters, we distribute the ID based on the first 2 bits
Result: 2(n1 + n2 + n3 + n4) / 4
Problem: mean is highly impacted with large outliers
Solution: use harmonic mean
"},{"location":"designdeck/#idempotent","title":"Idempotent","text":"If executed more than once it has the same effect as if it was executed once
"},{"location":"designdeck/#latency-numbers-every-programmer-should-know","title":"Latency numbers every programmer should know","text":"Lock with an expiry timeout after which the lock is automatically released
May lead to situations where two nodes believe they hold the lock (for example, when the expiry signal hasn't been caught yet by the first node because of a GC or CPU throttling)
Can be solved using a fencing token
"},{"location":"designdeck/#least-loaded-endpoint-load-balancing-strategy","title":"Least loaded endpoint load balancing strategy","text":"Not efficient
A more efficient option is to randomly pick two servers and route the request to the least-loaded one of the two
"},{"location":"designdeck/#liveness-property","title":"Liveness property","text":"Something good will eventually occur
Example: leader is elected, eventual consistency
"},{"location":"designdeck/#load-balancing","title":"Load balancing","text":"Route requests across a pool of servers
"},{"location":"designdeck/#load-shedding","title":"Load shedding","text":"Action to reduce the load on something
Example: when the CPU utilization reaches a threshold, the server can start returning errors
A more special form of load shedding is selective client throttling, where an application assigns different quotas to each of its clients
"},{"location":"designdeck/#locality","title":"Locality","text":"Performance optimization to put several pieces of data in the same place
"},{"location":"designdeck/#log","title":"Log","text":"Append-only, totally ordered sequence of messages
Each message is: - Appended at the end of the log - Is assigned a unique sequential index
Example: Kafka
"},{"location":"designdeck/#log-compaction","title":"Log compaction","text":"Throw away duplicate keys in the log and keep only the most recent update for each key
"},{"location":"designdeck/#main-drawback-of-shared-nothing-architectures","title":"Main drawback of shared-nothing architectures","text":"Reduce flexibility
If the application needs to access to new data access patterns in an efficient way, it might be hard to provide it given the system's data have been partitioned in a specific way
Example: attempting to query by a secondary attribute that is not the partitioning key might require to access all the nodes of the system
"},{"location":"designdeck/#mapreduce","title":"MapReduce","text":"Programming model for processing large amounts of data in bulk across many machines: - Map: processes a set of key/value pairs and produces as output another set of intermediate key/value pairs. - Reduce: receives all the values for each key and returns a single value, essentially merging all the values according to some logic
"},{"location":"designdeck/#microservices-pros-and-cons","title":"Microservices: pros and cons","text":"Pros: - Organizational (each team dictates its own release schedule, etc.) - Codebase is easier to digest - Strong boundaries - Independent scaling - Independent data model
Cons: - Eventual consistency - Remote calls - Harder to operate (more complex)
"},{"location":"designdeck/#number-of-values-to-generate-to-reach-50-chances-of-collision-32-bit-64-bit-and-128-bit-hash","title":"Number of values to generate to reach 50% chances of collision: 32-bit, 64-bit, and 128-bit hash","text":"Orchestration: single central system responsible for coordinating the execution
Choreography: no need for a central coordinator, each system is aware of the previous and the next
"},{"location":"designdeck/#outbox-pattern","title":"Outbox pattern","text":"Used to update a DB and publish an event in a transactional fashion
Within a transaction, persist in the DB (insert, update or delete) and insert at the same time a new row in an event table
Implements a worker that checks the event table, publishes an event and deletes the row (at least once guarantee)
"},{"location":"designdeck/#perfect-hashing","title":"Perfect hashing","text":"No collision, only possible if we know the keys up front
Given k elements, the hashing function returns an int between 0 and k
"},{"location":"designdeck/#quadtree","title":"Quadtree","text":"Tree data structure where each internal node has exactly four children: NE, NW, SE, SW
Main use case: - Improve geospatial caching (e.g., 1km in an urban area isn't the same as 1km outside cities)
Source: https://engblog.yext.com/post/geolocation-caching
"},{"location":"designdeck/#rate-limiting-throttling-definition-and-algos","title":"Rate-limiting (throttling): definition and algos","text":"Mechanism that rejects a request when a specific quota is exceeded
"},{"location":"designdeck/#token-bucket-algo","title":"Token bucket algo","text":"Token of a pre-defined capacity, put back in the bucket periodically:
"},{"location":"designdeck/#leaking-bucket-algo","title":"Leaking bucket algo","text":"Uses a FIFO queue When a request arrives, checks if the queue is full: - If yes: request is dropped - If not: added to the queue => Requests pulled from the queue at regular intervals
"},{"location":"designdeck/#rebalancing","title":"Rebalancing","text":"Move data or services from one node to another in order to spread the load fairly
"},{"location":"designdeck/#rest","title":"REST","text":"Architectural style where the server exposes a set of resources
All communications must be stateless and cacheable
Relies mainly on HTTP but not mandatory
"},{"location":"designdeck/#rest-vs-grpc","title":"REST vs. gRPC","text":"REST (architectural style): - Universality - Standardization (status code, ETag, If-Match, etc.)
gRPC (RPC framework): - Contract - Binary protocol (faster, less bandwidth) // We could use HTTP/2 without gRPC and leverage binary protocols but it would require more efforts - Bidirectional
"},{"location":"designdeck/#safety-property","title":"Safety property","text":"Something bad will never happen
Example: leader election eventually completes
"},{"location":"designdeck/#saga","title":"Saga","text":"Distributed transaction composed of a set of local transactions
Each transactions has a corresponding compensation action to undo its changes
Usually, a Saga is implemented with an orchestrator that manages the execution of the transactions and handles the compensations if needed
"},{"location":"designdeck/#scalability","title":"Scalability","text":"System's ability to cope with increased load
"},{"location":"designdeck/#scalability-ceiling","title":"Scalability ceiling","text":"Hard limit (e.g., device maximum throughput)
"},{"location":"designdeck/#shared-nothing-architectures","title":"Shared-nothing architectures","text":"Reduce coordination and contention so that every request can be processed independently by a single node or group of nodes
Increase availability, performance, and scalability
"},{"location":"designdeck/#source-of-truth","title":"Source of truth","text":"Holds the authoritative version of the data
"},{"location":"designdeck/#split-brain","title":"Split-brain","text":"Network partition => nodes unable to communicate with each other => multiple nodes believing they are the leader
As a node is unaware that another node is still functioning, it can lead to data corruption or data loss
"},{"location":"designdeck/#throughput","title":"Throughput","text":"The rate of work performed
"},{"location":"designdeck/#total-vs-partial-order","title":"Total vs. partial order","text":"Total order: a binary relation that can be used to compare any 2 elements of a set with each other
Partial order: a binary relation that can be used to compare only some of the elements of a set with each other
Total ordering in distributed systems is rarely mandatory
"},{"location":"designdeck/#uuid","title":"UUID","text":"128-bit number
Collision probability: after generating 1 billion UUID every second for ~100 years, the probability of creating a single duplicate reaches 50%
"},{"location":"designdeck/#validation-vs-verification","title":"Validation vs. verification","text":"Validation: process of analyzing the parts of the system and building mental models that reflects the interaction of those parts
Example: validate the quality of water by inspecting all the pipes and infrastructure to capture, clean and deliver water
Verification: process of analyzing output at a system boundary
Example: validate the quality of water by testing the water (output) coming from a sink
"},{"location":"designdeck/#vector-clock","title":"Vector clock","text":"Algorithm that generates partial ordering of events and detects causality violation
"},{"location":"designdeck/#why-asynchronous-communication","title":"Why asynchronous communication","text":"Reduce temporal coupling (not connected at the same time) => processes execute at independent rates, without blocking the sender
If the interaction pattern isn't request/response with client blocking until it receives the response
"},{"location":"designdeck/#http","title":"HTTP","text":""},{"location":"designdeck/#301-vs-302","title":"301 vs. 302","text":"301: redirect permanently
302: redirect temporarily
"},{"location":"designdeck/#403-or-404","title":"403 or 404?","text":"Retuning 403 can leak existence of a resource
Example: Apple is secretly working on super cars and creates an internal GET https://apple.com/supercar
endpoint
Returning 403 means the user doesn't have the rights to access the resource, but leaks the existence of /supercar
Small files stored on a user's computer to hold specific data (e.g., language preference)
Requests made by the browser will contain cookies data
Types of cookies: - Session cookies: only lasts for the duration of a session - Persistent cookies: outlast user session - Third-party cookies: used for advertising
"},{"location":"designdeck/#four-main-http2-features","title":"Four main HTTP/2 features","text":"HTTP live streaming: video streaming protocol
"},{"location":"designdeck/#http_1","title":"HTTP","text":"Request/response protocol used to encode and transport information between a client and a server Stateless (each request is executed independently)
The request and the response are 2 standard message types exchanged in a single HTTP transaction - Request: method, URL, HTTP version, headers, body - Response: HTTP version, status, reason, headers, body
Example of a POST request:
```http request POST https://example.com HTTP/1.0 Host: example.com User-Agent: Mozilla/4.0 Content-Length: 5
Hello ```
Application layer protocol (OSI level 7)
Relies on a transport protocol (OSI level 4, TCP most of the time but not mandatory) for error detection, flow control, reliability, etc.
"},{"location":"designdeck/#http-cache-control-header","title":"HTTP cache-control header","text":"Allows setting how long to cache a response
Part of the response header (hence, cached by the browser) but can be part of the request header too (hence, cached on server side)
If request header marked as private, the results are intended for a single user (then won't be cached by a load balancer for example)
"},{"location":"designdeck/#http-etag","title":"HTTP Etag","text":"Entity tag header that allows clients to make conditional requests
Server returns an ETag being the date and time of the last update of a resource
Client sends a If-Match
header to update a resource only if clients have the most recent version
Maintain a persistent TCP connection (reduces the number of TCP and HTTPS handshakes)
"},{"location":"designdeck/#http-methods-safeness-and-idempotence","title":"HTTP methods: safeness and idempotence","text":"Doesn't have any visible side effects and can be cached
"},{"location":"designdeck/#http-status-code-429","title":"HTTP status code 429","text":"When clients are throttled, the most common way is to return a 429 (Too Many Requests)
The response can also include a Retry-After header indicating how long to wait before making a new request (in seconds)
"},{"location":"designdeck/#http-status-codes","title":"HTTP status codes","text":"Source: https://github.com/alex/what-happens-when
"},{"location":"designdeck/#kafka","title":"Kafka","text":""},{"location":"designdeck/#consumer-types","title":"Consumer types","text":"Without consumer group: each consumer will receive all the messages in a topic
With consumer group: each consumer will receive a subset of the messages
Each consumer is assigned to multiple partitions (zero to many)
A partition is always assigned to only one consumer
If there are more consumers than partitions, some consumers will not be assigned to any partition (scalability ceiling)
"},{"location":"designdeck/#durabilityavailability-and-latencythroughput-tradeoffs","title":"Durability/availability and latency/throughput tradeoffs","text":"Source: https://developers.redhat.com/articles/2022/05/03/fine-tune-kafka-performance-kafka-optimization-theorem#kafka_priorities_and_the_cap_theorem
"},{"location":"designdeck/#log-compaction_1","title":"Log compaction","text":"Log compaction is a mechanism to give per-record retention to a topic
It ensures that Kafka will always retain at least the last message for each key of a given partition
A partition that is not yet compacted may have more than one message with the same key
Property: - retention.ms
: maximum time the topic will retain old log segments before deleting or compacting them (default: 7 days)
For low-throughput topic (topics with segments that should be rolled out because of segment.ms
rather than segment.bytes
), we should ensure that segment.ms is lower than retention.ms
A strictly increasing identifier per partition
"},{"location":"designdeck/#partition","title":"Partition","text":"Topics are divided into partitions
A partition is an ordered, immutable log of messages
No guaranteed ordering per topic with multiple partitions
Yet, the ordering is guaranteed per partition
"},{"location":"designdeck/#partition-distribution","title":"Partition distribution","text":"The client implements a partitioner based on the key (e.g., hash(key) % number of partitions)
This is not done on Kafka's side
If key is empty: round-robin
"},{"location":"designdeck/#rebalancing_1","title":"Rebalancing","text":"Not possible to decrease the number of partitions: topic has to be recreated
Possible to increase the number of partitions
Possible issue: no more guaranteed ordering as one key may be assigned to a different partition
"},{"location":"designdeck/#segment","title":"Segment","text":"Each partition is divided into segments
Instead of storing all the messages of a partition in a single file, Kafka splits them into chunks called segments A log segment is a file identified by the first message offset it contains
Properties: - segment.bytes
: maximum segment file size before creating a new segment (default: 1GB) - segment.ms
: period after which a new segment is created, even if the segment is not full (default: 7 days)
Distribute messages
All the consumers from one consumer group receive a portion of the messages
One partition is assigned to one consumer, one consumer can listen to multiple partitions
"},{"location":"designdeck/#math","title":"Math","text":""},{"location":"designdeck/#associative-property","title":"Associative property","text":"A binary operation is associative if rearranging the parentheses in an expression will not change the result
Example: +
is associative; e.g., (2 + 3) + 4 = 2 + (3 + 4)
A binary operation is commutative if changing the order of the operands doesn't change the result
Example: +
is commutative, /
isn't commutative
x1: probability of p1 (e.g. 0.5)
Less sensitive to large outliers
"},{"location":"designdeck/#network","title":"Network","text":""},{"location":"designdeck/#arp-protocol","title":"ARP protocol","text":"Map an IP address to a MAC address
"},{"location":"designdeck/#average-connection-speed-in-usa","title":"Average connection speed in USA","text":"42 Mbps
"},{"location":"designdeck/#backpressure","title":"Backpressure","text":"A node limits its own rate of sending in order to avoid overloading. Queueing is done on the sender side.
Also known as flow control
Example: TCP flow control
"},{"location":"designdeck/#bandwidth","title":"Bandwidth","text":"Maximum amount of data that can be transferred in a unit of time
"},{"location":"designdeck/#bgp","title":"BGP","text":"Border Gateway Protocol: Routing system of the internet
When a client submits data via the Internet, BGP is responsible for looking at all of the available paths that data could travel and picking the best route
Note: The chosen route isn't necessarily the fastest one, it can be the cheapest one. See https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-i.
"},{"location":"designdeck/#cors","title":"CORS","text":"Cross-origin resource sharing
Mechanism to allow restricted resources on a page to be requested from another domain outside the domain from which the resource was served
It extends and adds flexibility to SOP (Same-Origin Policy, same domain)
Example: User visits A and the page attempts to fetch data from B: 1. Browser sends a GET request to B with Origin header A 2. Server may respond with: - Access-Control-Allow-Origin (ACAO) header set to the domain A - ACAO set to a wildcard (*) indicating that the requests from all domains are allowed - An error if the server does not allow a cross-origin request
"},{"location":"designdeck/#difference-ping-heartbeat","title":"Difference ping & heartbeat","text":"Ping: sends messages to a process and expects a response within a specified time period (request-reply)
Heartbeat: a process is actively notifying its peers that it's still running by sending a message (notification)
"},{"location":"designdeck/#difference-tcp-udp","title":"Difference TCP & UDP","text":"A view is just an abstraction (SQL request is rewritten to match the actual schema)
A materialized view is a copy (written to disk)
"},{"location":"designdeck/#dns","title":"DNS","text":"Domain Name System: automatic translation between a name and an IP address
Notes: - Usually the local DNS configuration is the ISP one (config initialized from the router or static config) - The browser, the OS and the DNS resolver all use caches internally - A TTL is used to inform the cache how long the entry is valid
"},{"location":"designdeck/#dns-lookup-push-or-pull","title":"DNS lookup: push or pull","text":"DNS is based on the pull mode: - If record is present: DNS will return it - If record isn't present: DNS will pull the value, store it, and then return it
Notes: - New DNS records are immediate - DNS updates are slow because of TTL (there is no propagation, we wait for cached records to expire)
"},{"location":"designdeck/#health-checks-passive-vs-active","title":"Health checks: passive vs. active","text":"Passive: performed by the load balancer as it routes incoming requests (e.g., 503)
Active: the load balancer actively checking the health of the servers via a query to their health endpoint
"},{"location":"designdeck/#internet-model","title":"Internet model","text":"A network of networks
"},{"location":"designdeck/#layer-4-vs-layer-7-load-balancer","title":"Layer 4 vs. layer 7 load balancer","text":"Layer 4 is faster and requires less computing resources than layer 7 is but less flexible
Layer 4: look at the info at the transport layer to distribute the requests (source, destination, port)
Forward packet using NAT
Layer 7: look at the info at the application layer to distribute the requests (header, message, etc.)
Terminate the network traffic, read then open a connection to the target server
A layer 7 can de-multiplex individual HTTP requests where multiple concurrent streams are multiplexed on the same TCP connection
"},{"location":"designdeck/#mac-address","title":"MAC address","text":"A unique identifier assigned to a network interface
"},{"location":"designdeck/#max-size-of-a-tcp-packet","title":"Max size of a TCP packet","text":"64K
"},{"location":"designdeck/#mqtt-lwt","title":"MQTT LWT","text":"Last Will and Testament
Whenever a client is marked as disconnected (proper disconnection or heartbeat failure), it triggers to send a message in a particular topic
"},{"location":"designdeck/#ntp","title":"NTP","text":"Network Time Protocol: used to synchronize clocks
"},{"location":"designdeck/#osi-model","title":"OSI model","text":"7 layers: 1. Physical: transmission of raw bits over a physical link (e.g., USB, Bluetooth) 2. Data link: responsible from moving a packet of data from one node to a neighbouring node 3. Network: provides a way of sending packets between nodes that are not directly linked and might belong to other networks (e.g., IP, iptables routing) 4. Transport: application to application communication, based on ports when multiple applications on the same node wants to communicate (e.g., TCP, UDP) 5. Session 6. Presentation 7. Application: protocol of exchanges between the two sides (e.g., DNS, HTTP)
"},{"location":"designdeck/#routers","title":"Routers","text":"A way to connect networks that are connected with each other (used for the Internet)
Capable of routing packets properly across networks so that they reach their destination successfully
Based on the fact that an IP has a network prefix
"},{"location":"designdeck/#routers-buffering","title":"Routers buffering","text":"Routers use queuing (buffering) to address network congestion
A buffer has a fixed size and a fixed number of packets
If no available buffer: packet is dropped
Note: not a way to increase the throughput
"},{"location":"designdeck/#routers-processing","title":"Routers processing","text":"Per-packet processing, no buffering
Impacts: - It\u2019s faster to route 10 packets of 1000 bytes than 20 packets of 500 bytes - Sending small packets more frequently can fill the router buffer more quickly
Source: https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-i
"},{"location":"designdeck/#routing-table","title":"Routing table","text":"Example:
Destination Network mask Gateway Interface 0.0.0.0 0.0.0.0 240.1.1.3 if1 240.1.1.0 255.255.255.0 0.0.0.0 if1"},{"location":"designdeck/#service-mesh","title":"Service mesh","text":"All network traffic from a client goes through a process co-located on the same machine (sidecar)
Used to facilitate service-to-service communications
"},{"location":"designdeck/#switch","title":"Switch","text":"Receive frame and forward to specific links they are addressed to. Used for local networks.
Example: Ethernet frame
To do this, the switch maintains a switch table that maps MAC addresses to the corresponding interfaces that lead to them
At first, the switch table is empty. If the entry is empty, a frame is forwarded to all the interfaces (switches are self-learning)
"},{"location":"designdeck/#tcp-congestion-control","title":"TCP congestion control","text":"Determine dynamically the throughput (the number of segments that can be sent without an ack): - Increase exponentially for every segment ack - Decrease with a missed ack
Upon a new connection, the size of the window is set to a system default
It's one of the reasons why reusing a TCP connection leads to a performance increase
"},{"location":"designdeck/#tcp-connection-backlog","title":"TCP connection backlog","text":"SYN requests are queued before being accepted by a user-mode process
When there are too many requests for the process, the backlog reaches a limit and SYN packets are dropped (to be later retransmitted by the client)
"},{"location":"designdeck/#tcp-flow-control","title":"TCP flow control","text":"A receiver communicates back to the sender the size of the buffer when acknowledging a segment
Backpressure mechanism
"},{"location":"designdeck/#tcp-handshake","title":"TCP handshake","text":"3-way handshake - syn (sender to receiver) - syn-ack (receiver to sender) // ack the segment number received - ack (sender to receiver) // ack the segment number received
"},{"location":"designdeck/#websocket","title":"Websocket","text":"Communication protocol (layer 7) provides a full-duplex communication channel over a single TCP connection and bidirectional streaming capabilities
Different from HTTP but compatible with HTTP (starts as an HTTP connection and then is upgraded via a well-defined handshake to a TCP connection)
Obsolete with HTTP/2
"},{"location":"designdeck/#why-cant-we-rely-on-the-system-clock-in-distributed-systems","title":"Why can't we rely on the system clock in distributed systems?","text":"Provides guaranteed fault isolation by design
Based on the idea of partitioning a shared resource to isolate failures
"},{"location":"designdeck/#cascading-failure","title":"Cascading failure","text":"A process in a system of interconnected parts in which the failure of one or few parts can trigger the failure of other parts and so on
"},{"location":"designdeck/#causal-consistency-implementation","title":"Causal consistency implementation","text":"When a replica receives a new write, it doesn't apply it locally immediately. First, it checks whether the write's dependencies have been committed locally. If not, it waits until the required version appears.
"},{"location":"designdeck/#circuit-breaker","title":"Circuit breaker","text":"Used to prevent a network or service failure from cascading to other failures
Implemented on the client-side
Three states: - Closed: accept requests - Open: do not accept requests and fail immediately - Half-open: give the service another chance (can also be implemented using a probe)
The circuit can be opened when the health endpoint of the service is down or when the number of consecutive errors reaches a threshold
"},{"location":"designdeck/#exponential-backoff","title":"Exponential backoff","text":"Wait time increased exponentially after every retry attempt
"},{"location":"designdeck/#fault-tolerance","title":"Fault tolerance","text":"Property of a system that can continue operating correctly in the presence of failure of its components
"},{"location":"designdeck/#jitter","title":"Jitter","text":"Introduces a part of randomness to avoid synchronized retry spikes experienced during cascading failures
"},{"location":"designdeck/#knee-point","title":"Knee point","text":"Moment when linear scalability is not possible anymore
"},{"location":"designdeck/#phi-accrual-failure-detector","title":"Phi-accrual failure detector","text":"Instead of treating failure node failure as a binary problem (up or down), a phi-accrual failure detector has a continuous scale, capturing the probability of the monitored process's crash
Works by maintaining a sliding window, collecting arrival times of the most recent heartbeats
Used to approximate the arrival time of the next heartbeat and compute a suspicion level (how certain the failure detector is about a failure)
"},{"location":"designdeck/#retry-amplification","title":"Retry amplification","text":"Having retries at multiple levels of the dependency chain can amplify the number of retry
The deeper a service in the chain, the higher the load it will be exposed to due to amplification:
In case of a long dependency chain, perhaps we should only retry at a single level of the chain
"},{"location":"designdeck/#security","title":"Security","text":""},{"location":"designdeck/#authentication","title":"Authentication","text":"Process of determining whether someone or something is who or what it declares itself to be
"},{"location":"designdeck/#certificate-authorities","title":"Certificate authorities","text":"Organizations issuing certificates by signing them
"},{"location":"designdeck/#cipher","title":"Cipher","text":"Encryption algorithm
"},{"location":"designdeck/#confidentiality","title":"Confidentiality","text":"Process of protecting information from being accessed by unauthorized parties
Mainly achieved via encryption
"},{"location":"designdeck/#integrity","title":"Integrity","text":"The process of preserving the accuracy and completeness of data over its entire lifecycle, so that they cannot be modified in an unauthorized or undetected manner
"},{"location":"designdeck/#mutual-tls","title":"Mutual TLS","text":"Add client authentication using a certificate
"},{"location":"designdeck/#oauth-2","title":"OAuth 2","text":"Standard for access delegation
Process - Client gets a token from an authorization server - Makes a request to a server using the token - Server validates the token to the authorization server
Notes: some token types like JWT are self-contained, meaning the validation can be done by the server without a call to the authorization server
"},{"location":"designdeck/#public-key-infrastructure-pki","title":"Public key infrastructure (PKI)","text":"System for managing, storing, and distributing certificates
Relies on certificate revocation lists (CRLs)
"},{"location":"designdeck/#tls-handshake","title":"TLS handshake","text":"With mutual TLS:
One way: the session key is generated by the client
"},{"location":"designdeck/#two-main-uses-of-encryption","title":"Two main uses of encryption","text":"Encryption in transit
Encryption at rest
"},{"location":"designdeck/#two-types-of-encryption","title":"Two types of encryption","text":"Symmetric: key is shared between a client and a server (faster)
Asymmetric: two keys are used, a private and a public one - Client encrypts a message with the public key - Server decrypts the message with its private key
"},{"location":"designdeck/#what-does-digital-signature-provide","title":"What does digital signature provide","text":"Integrity and authentication
"},{"location":"designdeck/#what-does-tls-provide","title":"What does TLS provide?","text":"Check the Anki version here.
"},{"location":"#array","title":"Array","text":""},{"location":"#algorithm-to-reverse-an-array","title":"Algorithm to reverse an array","text":"int i = 0;\nint j = a.length - 1;\nwhile (i < j) {\n swap(a, i++, j--);\n}\n
"},{"location":"#array-complexity-access-search-insert-delete","title":"Array complexity: access, search, insert, delete","text":"Access: O(1)
Search: O(n)
Insert: O(n)
Delete: O(n)
"},{"location":"#binary-search-in-a-sorted-array-algorithm","title":"Binary search in a sorted array algorithm","text":"int lo = 0, hi = a.length - 1;\n\nwhile (lo <= hi) {\n int mid = lo + ((hi - lo) / 2);\n if (a[mid] == key) {\n return mid;\n }\n if (a[mid] < key) {\n lo = mid + 1;\n } else {\n hi = mid - 1;\n }\n}\n
"},{"location":"#further-reading","title":"Further Reading","text":"Solution: binary search
Check first if the array is rotated. If not, apply normal binary search
If rotated, find pivot (smallest element, only element whose previous is bigger)
Then, check if the element is in 0..pivot-1 or pivot..len-1
int findElementRotatedArray(int[] a, int val) {\n // If array not rotated\n if (a[0] < a[a.length - 1]) {\n // We apply the normal binary search\n return binarySearch(a, val, 0, a.length - 1);\n }\n\n int pivot = findPivot(a);\n\n if (val >= a[0] && val <= a[pivot - 1]) {\n // Element is before the pivot\n return binarySearch(a, val, 0, pivot - 1);\n } else if (val >= a[pivot] && val < a.length - 1) {\n // Element is after the pivot\n return binarySearch(a, val, pivot, a.length - 1);\n }\n return -1;\n}\n
"},{"location":"#given-an-array-move-all-the-0-to-the-left-while-maintaining-the-order-of-the-other-elements","title":"Given an array, move all the 0 to the left while maintaining the order of the other elements","text":"Example: 1, 0, 2, 0, 3, 0 => 0, 0, 0, 1, 2, 3
Two pointers technique: read and write starting at the end of the array
If read is on a 0, decrement read. Otherwise swap, decrement both
public void move(int[] a) {\n int w = a.length - 1, r = a.length - 1;\n while (r >= 0) {\n if (a[r] == 0) {\n r--;\n } else {\n swap(a, r--, w--);\n }\n }\n}\n
Time complexity: O(n)
Space complexity: O(1)
"},{"location":"#how-to-detect-if-an-element-is-a-pivot-in-a-rotated-sorted-array","title":"How to detect if an element is a pivot in a rotated sorted array","text":"Only element whose previous is bigger (also the pivot is the smallest element)
"},{"location":"#how-to-find-a-pivot-element-in-a-rotated-array","title":"How to find a pivot element in a rotated array","text":"Check first if the array is rotated
Then, apply binary search (comparison with a[right] to know if we go left or right)
int findPivot(int[] a) {\n int left = 0, right = a.length - 1;\n\n // Array is not rotated\n if (a[left] < a[right]) {\n return -1;\n }\n\n while (left <= right) {\n int mid = left + ((right - left) / 2);\n if (mid > 0 && a[mid] < a[mid - 1]) {\n return a[mid];\n }\n\n if (a[mid] < a[right]) {\n // Pivot is on the left\n right = mid - 1;\n } else {\n // Pivot is on the right\n left = mid + 1;\n }\n }\n\n return -1;\n}\n
"},{"location":"#how-to-find-the-duplicates-in-an-array","title":"How to find the duplicates in an array","text":"When full, create a new array of twice the size, copy items (System.arraycopy is optimized for that)
Shrink: - Not when one-half full (otherwise worst case is too expensive: double-shrink-double-shrink etc.) - Solution: one-quarter full
"},{"location":"#how-to-test-if-the-array-is-sorted-in-ascending-or-descending-order","title":"How to test if the array is sorted in ascending or descending order","text":"Test first and last element (no iteration)
"},{"location":"#rotate-an-array-by-n-elements-n-can-be-negative","title":"Rotate an array by n elements (n can be negative)","text":"Example: 1, 2, 3, 4, 5 with n = 3 => 3, 4, 5, 1, 2
void rotateArray(List<Integer> a, int n) {\n if (n < 0) {\n n = a.size() + n;\n }\n\n reverse(a, 0, a.size() - 1);\n reverse(a, 0, n - 1);\n reverse(a, n, a.size() - 1);\n}\n
Time complexity: O(n)
Memory complexity: O(1)
"},{"location":"#bit","title":"Bit","text":""},{"location":"#operator","title":"& operator","text":"AND bit by bit
"},{"location":"#operator_1","title":"<< operator","text":"Shift on the left
n * 2 <=> left shift by 1
n * 4 <=> left shift by 2
"},{"location":"#operator_2","title":">> operator","text":"Shift on the right
"},{"location":"#operator_3","title":">>> operator","text":"Logical shift (shift the sign bit as well)
"},{"location":"#operator_4","title":"^ operator","text":"XOR bit by bit
"},{"location":"#bit-vector-structure","title":"Bit vector structure","text":"Vector (linear sequence of numeric values stored contiguously in memory) in which each element is a bit (so either 0 or 1)
"},{"location":"#check-exactly-one-bit-is-set","title":"Check exactly one bit is set","text":"boolean checkExactlyOneBitSet(int num) {\n return num != 0 && (num & (num - 1)) == 0;\n}\n
"},{"location":"#clear-bits-from-i-to-0","title":"Clear bits from i to 0","text":"int clearBitsFromITo0(int num, int i) {\n int mask = (-1 << (i + 1));\n return num & mask;\n}\n
"},{"location":"#clear-bits-from-most-significant-one-to-i","title":"Clear bits from most significant one to i","text":"int clearBitsFromMsbToI(int num, int i) {\n int mask = (1 << i) - 1;\n return num & mask;\n}\n
"},{"location":"#clear-ith-bit","title":"Clear ith bit","text":"int clearBit(final int num, final int i) {\n final int mask = ~(1 << i);\n return num & mask;\n}\n
"},{"location":"#flip-ith-bit","title":"Flip ith bit","text":"int flipBit(final int num, final int i) {\n return num ^ (1 << i);\n}\n
"},{"location":"#get-ith-bit","title":"Get ith bit","text":"boolean getBit(final int num, final int i) {\n return ((num & (1 << i)) != 0);\n}\n
"},{"location":"#how-to-flip-one-bit","title":"How to flip one bit","text":"b ^ 1
"},{"location":"#how-to-represent-signed-integers","title":"How to represent signed integers","text":"Use the most significative bit to represent the sign. Yet, it is not enough (problem with this technique: 5 + (-5) != 0)
Two's complement technique: take the one complement and add one
-3: 1101
-2: 1110
-1: 1111
0: 0000
1: 0001
2: 0010
3: 0011
The most significant bit still represents the sign
Max integer value: 1...1 (31 bits)
-1: 1...1 (32 bits)
"},{"location":"#set-ith-bit","title":"Set ith bit","text":"int setBit(final int num, final int i) {\n return num | (1 << i);\n}\n
"},{"location":"#update-a-bit-from-a-given-value","title":"Update a bit from a given value","text":"int updateBit(int num, int i, boolean bit) {\n int value = bit ? 1 : 0;\n int mask = ~(1 << i);\n return (num & mask) | (value << i);\n}\n
"},{"location":"#x-0s","title":"x & 0s","text":"0
"},{"location":"#x-1s","title":"x & 1s","text":"x
"},{"location":"#x-x","title":"x & x","text":"x
"},{"location":"#x-0s_1","title":"x ^ 0s","text":"x
"},{"location":"#x-1s_1","title":"x ^ 1s","text":"~x
"},{"location":"#x-x_1","title":"x ^ x","text":"0
"},{"location":"#x-0s_2","title":"x | 0s","text":"x
"},{"location":"#x-1s_2","title":"x | 1s","text":"1s
"},{"location":"#x-x_2","title":"x | x","text":"x
"},{"location":"#xor-operations","title":"XOR operations","text":"0 ^ 0 = 0
1 ^ 0 = 1
0 ^ 1 = 1
1 ^ 1 = 0
n XOR 0 => keep
n XOR 1 => flip
"},{"location":"#operator_5","title":"| operator","text":"OR bit by bit
"},{"location":"#operator_6","title":"~ operator","text":"Complement bit by bit
"},{"location":"#complexity","title":"Complexity","text":"Big-O Cheat Sheet
"},{"location":"#01-knapsack-brute-force-complexity","title":"0/1 Knapsack brute force complexity","text":"Time complexity: O(2^n) with n the number of items
Space complexity: O(n)
"},{"location":"#01-knapsack-memoization-complexity","title":"0/1 Knapsack memoization complexity","text":"Time and space complexity: O(n * c) with n the number items and c the capacity
"},{"location":"#01-knapsack-tabulation-complexity","title":"0/1 Knapsack tabulation complexity","text":"Time and space complexity: O(n * c) with n the number of items and c the capacity
Space complexity could even be improved to O(2*c) = O(c) as we need to store only the last 2 lines (using row%2):
int[][] dp = new int[2][c + 1];\n
"},{"location":"#amortized-complexity-definition","title":"Amortized complexity definition","text":"How much of a resource (time or memory) it takes to execute per operation on average
"},{"location":"#array-complexity-access-search-insert-delete_1","title":"Array complexity: access, search, insert, delete","text":"Access: O(1)
Search: O(n)
Insert: O(n)
Delete: O(n)
"},{"location":"#b-tree-complexity-access-insert-delete","title":"B-tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#bfs-and-dfs-graph-traversal-time-and-space-complexity","title":"BFS and DFS graph traversal time and space complexity","text":"Time: O(v + e) with v the number of vertices and e the number of edges
Space: O(v)
"},{"location":"#bfs-and-dfs-tree-traversal-time-and-space-complexity","title":"BFS and DFS tree traversal time and space complexity","text":"BFS: time O(v), space O(v)
DFS: time O(v), space O(h) (height of the tree)
"},{"location":"#big-o","title":"Big O","text":"Upper bound
"},{"location":"#big-omega","title":"Big Omega","text":"Lower bound (fastest)
"},{"location":"#big-theta","title":"Big Theta","text":"Theta(n) if both O(n) and Omega(n)
"},{"location":"#binary-heap-min-heap-or-max-heap-complexity-insert-get-min-max-delete-min-max","title":"Binary heap (min-heap or max-heap) complexity: insert, get min (max), delete min (max)","text":"Insert: O(log (n))
Get min (max): O(1)
Delete min: O(log n)
If not balanced O(n)
If balanced O(log n)
"},{"location":"#bst-delete-algo-and-complexity","title":"BST delete algo and complexity","text":"Find inorder successor and swap it
Average: O(log n)
Worst: O(h) if not self-balanced BST, otherwise O(log n)
"},{"location":"#bubble-sort-complexity-and-stability","title":"Bubble sort complexity and stability","text":"Time: O(n\u00b2)
Space: O(1)
Stable
"},{"location":"#complexity-of-a-function-making-multiple-recursive-subcalls","title":"Complexity of a function making multiple recursive subcalls","text":"Time: O(branches^depth) with branches the number of times each recursive call branches (english: 2 power 3)
Space: O(depth) to store the call stack
"},{"location":"#complexity-to-create-a-trie","title":"Complexity to create a trie","text":"Time and space: O(n * l) with n the number of words and l the longest word length
"},{"location":"#complexity-to-insert-a-key-in-a-trie","title":"Complexity to insert a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative, O(k) recursive
"},{"location":"#complexity-to-search-for-a-key-in-a-trie","title":"Complexity to search for a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative or O(k) recursive
"},{"location":"#counting-sort-complexity-stability-use-case","title":"Counting sort complexity, stability, use case","text":"Time complexity: O(n + k) // n is the number of elements, k is the range (the maximum element)
Space complexity: O(k)
Stable
Use case: known and small range of possible integers
"},{"location":"#doubly-linked-list-complexity-access-insert-delete","title":"Doubly linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#hash-table-complexity-search-insert-delete","title":"Hash table complexity: search, insert, delete","text":"All: amortized O(1), worst O(n)
"},{"location":"#heapsort-complexity-stability-use-case","title":"Heapsort complexity, stability, use case","text":"Time: Theta(n log n)
Space: O(1)
Unstable
Use case: space constrained environment with O(n log n) time guarantee
Yet, not stable and not cache friendly
"},{"location":"#insertion-sort-complexity-stability-use-case","title":"Insertion sort complexity, stability, use case","text":"Time: O(n\u00b2)
Space: O(1)
Stable
Use case: partially sorted structure
"},{"location":"#linked-list-complexity-access-insert-delete","title":"Linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#mergesort-complexity-stability-use-case","title":"Mergesort complexity, stability, use case","text":"Time: Theta(n log n)
Space: O(n)
Stable
Use case: good worst case time complexity and stable, good with linked list
"},{"location":"#quicksort-complexity-stability-use-case","title":"Quicksort complexity, stability, use case","text":"Time: best and average O(n log n), worst O(n\u00b2) if the array is already sorted in ascending or descending order
Space: O(log n) // In-place sorting algorithm
Not stable
Use case: in practice, quicksort is often faster than merge sort due to better locality (not applicable with linked list so in this case we prefer mergesort)
"},{"location":"#radix-sort-complexity-stability-use-case","title":"Radix sort complexity, stability, use case","text":"Time complexity: O(nk) // n is the number of elements, k is the maximum number of digits for a number
Space complexity: O(k)
Stable
Use case: if k < log(n) (for example 1M of elements from 0..1000 as 4 < log(1M))
"},{"location":"#recursivity-impacts-on-algorithm-complexity","title":"Recursivity impacts on algorithm complexity","text":"Space impact as each call is added to the call stack
Unless we use tail call recursion
"},{"location":"#red-black-tree-complexity-access-insert-delete","title":"Red-black tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#selection-sort-complexity","title":"Selection sort complexity","text":"Time: Theta(n\u00b2)
Space: O(1)
"},{"location":"#stack-implementations-and-insertdelete-complexity","title":"Stack implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(n), amortized time O(1)
Delete: O(1)
"},{"location":"#time-complexity-to-build-a-binary-heap","title":"Time complexity to build a binary heap","text":"O(n)
Time and space: O(v + e)
"},{"location":"#dynamic-programming","title":"Dynamic Programming","text":""},{"location":"#dynamic-programming-concept","title":"Dynamic programming concept","text":"Break down a problem in smaller parts and store the results of these subproblems so that they only need to be computed once
A DP algorithm will search through all of the possible subproblems (main difference with greedy algorithms)
Based on either: - Memoization (top-down) - Tabulation (bottom-up)
"},{"location":"#memoization-vs-tabulation","title":"Memoization vs tabulation","text":"Optimization technique to cache previously computed results
Used by dynamic programming algorithms
Memoization: top-down (start with a large, complex problem and break it down into smaller sub-problems)
f(x) {\n if (mem[x] is undefined)\n mem[x] = f(x-1) + f(x-2)\n return mem[x]\n}\n
Tabulation: bottom-up (start with the smallest solution and then build up each solution until we arrive at the solution to the initial problem)
tabFib(n) {\n mem[0] = 0\n mem[1] = 1\n for i = 2...n\n mem[i] = mem[i-2] + mem[i-1]\n return mem[n]\n}\n
"},{"location":"#encoding","title":"Encoding","text":""},{"location":"#ascii-charset","title":"ASCII charset","text":"128 characters
"},{"location":"#difference-encodingcharset","title":"Difference encoding/charset","text":"Charset: set of characters to be used (e.g. ASCII 128 characters)
Encoding: translation of a list of characters in binary
Encoding is used because for all charset we can't guarantee 1 character = 1 byte
Example: UTF-8 to encode Unicode characters using from 1 byte (english) up to 6 bytes
"},{"location":"#unicode-charset","title":"Unicode charset","text":"Superset of ASCII with 2^21 characters
"},{"location":"#general","title":"General","text":""},{"location":"#before-finding-a-solution","title":"Before finding a solution","text":"1) Make sure to understand the problem by listing: - Inputs - Outputs (what do we search) - Constraints
2) Draw examples
"},{"location":"#comparator-implementation-to-order-two-integers","title":"Comparator implementation to order two integers","text":"Ordering, min-heap: (a, b) -> a - b
Reverse ordering, max-heap: (a, b) -> b - a
7 ways: 1. a and b do not overlap 2. a and b overlap, b ends after a 3. a completely overlaps b 4. a and b overlap, a ends after b 5. b completely overlaps a 6. a and b do no overlap 7. a and b are equals
"},{"location":"#different-ways-for-two-intervals-to-relate-to-each-other-if-ordered-by-start-then-end","title":"Different ways for two intervals to relate to each other if ordered by start then end","text":"2 different ways: - No overlap - Overlap // Merge intervals (start of the first interval, max of the two ends)
"},{"location":"#divide-and-conquer-algorithm-paradigm","title":"Divide and conquer algorithm paradigm","text":"Example with merge sort: 1. Split the array into two halves 2. Sort them (recursive call) 3. Merge the two halves
"},{"location":"#how-to-name-a-matrix-indexes","title":"How to name a matrix indexes","text":"Use m[row][col] instead of m[y][x]
"},{"location":"#if-stucked-on-a-problem","title":"If stucked on a problem","text":"Mutates an input
"},{"location":"#p-vs-np-problems","title":"P vs NP problems","text":"P (polynomial): set of problems that can be solved reasonably fast (example: multiplication, sorting, etc.)
Complexity is not exponential
NP (non-deterministic polynomial): set of problems where given a solution, we can test is it is a correct one in a reasonable amount of time but finding the solution is not fast (example: a 1M*1M sudoku grid, traveling salesman problem, etc)
NP-complete: hardest problems in the NP set
There are other sets of problems that are not P nor NP as an answer is really hard to prove (example: best move in a chess game)
P = NP means does being able to quickly recognize correct answers means there's also a quick way to find them?
"},{"location":"#solving-optimization-problems","title":"Solving optimization problems","text":"Preserve the original order of elements with equal key
"},{"location":"#what-do-to-after-having-designed-a-solution","title":"What do to after having designed a solution","text":"Testing on nominal cases then edge cases
Time and space complexity
"},{"location":"#graph","title":"Graph","text":""},{"location":"#a-algorithm","title":"A* algorithm","text":"Complete solution to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While priority queue is not empty: poll an element and inserts all neighbours - If target is reached, update a min variable
Priority is computed using the evaluation function: f(n) = h + g where h is an heuristic (local cost to visit a node) and g is the cost so far (length of the path so far)
"},{"location":"#backedge-definition","title":"Backedge definition","text":"An edge from a node to itself or to an ancestor
"},{"location":"#best-first-search-algorithm","title":"Best-first search algorithm","text":"Greedy solution (non-complete) to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While target not reached: poll an element and inserts all neighbours
Priority is computed using the evaluation function: f(n) = h where h is an heuristic (local cost to visit a node)
"},{"location":"#bfs-dfs-graph-traversal-use-cases","title":"BFS & DFS graph traversal use cases","text":"BFS: shortest path
DFS: does a path exist, does a cycle exist (memo: D for Does)
DFS stores a single path at a time, requires less memory than BFS (on average but same space complexity)
"},{"location":"#bfs-and-dfs-graph-traversal-time-and-space-complexity_1","title":"BFS and DFS graph traversal time and space complexity","text":"Time: O(v + e) with v the number of vertices and e the number of edges
Space: O(v)
"},{"location":"#bidirectional-search","title":"Bidirectional search","text":"Run two simultaneous BFS, one from the source, one from the target
Once their searches collide, we found a path
If branching factor of a tree is b and the distance to the target vertex is d, then the normal BFS/DFS searching time complexity would we O(b^d)
Here it is O(b^(d/2))
"},{"location":"#connected-graph-definition","title":"Connected graph definition","text":"If there is a path between every pair of vertices, the graph is called connected
Otherwise, the graph consists of multiple isolated subgraphs
"},{"location":"#difference-best-first-search-and-a-algorithms","title":"Difference Best-first search and A* algorithms","text":"Best-first search is a greedy solution: not complete // a solution can be not optimal
A*: complete
"},{"location":"#dijkstra-algorithm","title":"Dijkstra algorithm","text":"Input: graph, initial vertex
Output: for each vertex: shortest path and previous node // The previous node is the one we are coming from in the shortest path. To find the shortest path between two nodes, we need to iterate backwards. Example: A -> C => E, D, A
Algorithm: - Init the shortest distance to MAX except for the initial node - Init a priority queue where the comparator will be on the total distance so far - Init a set to store all visited node - Add initial vertex to the priority queue - While queue is not empty: Poll a vertex (mark it visited) and check the total distance to each neighbour (current distance + distance so far), update shortest and previous arrays if smaller. If destination was unvisited, adds it to the queue
void dijkstra(GraphAjdacencyMatrix graph, int initial) {\n Set<Integer> visited = new HashSet<>();\n\n int n = graph.vertex;\n int[] shortest = new int[n];\n int[] previous = new int[n];\n for (int i = 0; i < n; i++) {\n if (i != initial) {\n shortest[i] = Integer.MAX_VALUE;\n }\n }\n\n // Entry: key=vertex, value=distance so far\n PriorityQueue<Entry> minHeap = new PriorityQueue<>((e1, e2) -> e1.value - e2.value);\n minHeap.add(new Entry(initial, 0));\n\n while (!minHeap.isEmpty()) {\n Entry current = minHeap.poll();\n int source = current.key;\n int distanceSoFar = current.value;\n\n // Get neighbours\n List<GraphAjdacencyMatrix.Edge> edges = graph.getEdge(source);\n\n for (GraphAjdacencyMatrix.Edge edge : edges) {\n // For each neighbour, check the total distance\n int distance = distanceSoFar + edge.distance;\n if (distance < shortest[edge.destination]) {\n shortest[edge.destination] = distance;\n previous[edge.destination] = source;\n }\n\n // Add the element in the queue if not visited\n if (!visited.contains(edge.destination)) {\n minHeap.add(new Entry(edge.destination, distance));\n }\n }\n\n visited.add(source);\n }\n\n print(shortest);\n print(previous);\n}\n
"},{"location":"#dynamic-connectivity-problem","title":"Dynamic connectivity problem","text":"Given a set of nodes and edges: are two nodes connected (directly or in-directly)?
Two methods: - union(2, 5) // connect object 2 with object 5 - connected(1 , 6) // is object 1 connected to object 6?
"},{"location":"#further-reading_1","title":"Further Reading","text":"Array of integer of size N initialized with their index (0: 0, 1: 1 etc.).
If two indexes have the same value, they belong to the same group.
Init: integer array of size N
Interpretation: id[i] is parent of i, root parent if id[i] == i
Modify quick-union to avoid tall tree
Keep track of the size of each tree (number of nodes): extra array size[i] to count number of objects in the tree rooted at i
O(n) extra space
Solution: topological sort
If there's a cycle in the relations, it means it is not possible to shedule all the tasks
There is a cycle if the produced sorted array size is different from n
"},{"location":"#graph-definition","title":"Graph definition","text":"A way to represent a network, or a collection of inteconnected objects
G = (V, E) with V a set of vertices (or nodes) and E a set of edges (or links)
"},{"location":"#graph-traversal-bfs","title":"Graph traversal: BFS","text":"Traverse broad into the graph by visiting the sibling/neighbor before children nodes (one level of children at a time)
Iterative using a queue
Algorithm: similar with tree except we need to mark the visited nodes, can start with any nodes
Queue<Node> queue = new LinkedList<>();\nNode first = graph.nodes.get(0);\nqueue.add(first);\nfirst.markVisitied();\n\nwhile (!queue.isEmpty()) {\n Node node = queue.poll();\n System.out.println(node.name);\n\n for (Edge edge : node.connections) {\n if (!edge.end.visited) {\n queue.add(edge.end);\n edge.end.markVisited();\n }\n }\n}\n
"},{"location":"#graph-traversal-dfs","title":"Graph traversal: DFS","text":"Traverse deep into the graph by visiting the children before sibling/neighbor nodes (traverse down one single path)
Walk through a path, backtrack until we found a new path
Algorithm: recursive or iterative using a stack (same algo than BFS except we use a queue instead of a stack)
"},{"location":"#how-to-compute-the-shortest-path-between-two-nodes-in-an-unweighted-graph","title":"How to compute the shortest path between two nodes in an unweighted graph","text":"BFS traversal by using an array to keep track of the min distance distances[i] gives the shortest distance between the input node and the node of id i
Algorithm: no need to keep track of the visited node, it is replaced by a test on the distance array
Queue<Node> queue = new LinkedList<>();\nqueue.add(parent);\nint[] distances = new int[graph.nodes.size()];\nArrays.fill(distances, -1);\ndistances[parent.id] = 0;\n\nwhile (!queue.isEmpty()) {\n Node node = queue.poll();\n for (Edge edge : node.connections) {\n if (distances[edge.end.id] == -1) {\n queue.add(edge.end);\n distances[edge.end.id] = distances[node.id] + 1;\n }\n }\n}\n
"},{"location":"#how-to-detect-a-cycle-in-a-directed-graph","title":"How to detect a cycle in a directed graph","text":"Using DFS by marking the visited nodes, there is a cycle if a visited node is also part of the current stack
The stack can be managed as a boolean array
boolean isCyclic(DirectedGraph g) {\n boolean[] visited = new boolean[g.size()];\n boolean[] stack = new boolean[g.size()];\n\n for (int i = 0; i < g.size(); i++) {\n if (isCyclic(g, i, visited, stack)) {\n return true;\n }\n }\n return false;\n}\n\nboolean isCyclic(DirectedGraph g, int node, boolean[] visited, boolean[] stack) {\n if (stack[node]) {\n return true;\n }\n\n if (visited[node]) {\n return false;\n }\n\n stack[node] = true;\n visited[node] = true;\n\n List<DirectedGraph.Edge> edges = g.getEdges(node);\n for (DirectedGraph.Edge edge : edges) {\n int destination = edge.destination;\n if (isCyclic(g, destination, visited, stack)) {\n return true;\n }\n }\n\n // Backtrack\n stack[node] = false;\n\n return false;\n}\n
"},{"location":"#how-to-detect-a-cycle-in-an-undirected-graph","title":"How to detect a cycle in an undirected graph","text":"Using DFS
Idea: for every visited vertex v, if there is an adjacent u such that u is already visited and u is not the parent of v, then there is a cycle
public boolean isCyclic(UndirectedGraph g) {\n boolean[] visited = new boolean[g.size()];\n for (int i = 0; i < g.size(); i++) {\n if (!visited[i]) {\n if (isCyclic(g, i, visited, -1)) {\n return true;\n }\n }\n }\n return false;\n}\n\nprivate boolean isCyclic(UndirectedGraph g, int v, boolean[] visited, int parent) {\n visited[v] = true;\n\n List<UndirectedGraph.Edge> edges = g.getEdges(v);\n for (UndirectedGraph.Edge edge : edges) {\n if (!visited[edge.destination]) {\n if (isCyclic(g, edge.destination, visited, v)) {\n return true;\n }\n } else if (edge.destination != parent) {\n return true;\n }\n }\n return false;\n}\n
"},{"location":"#how-to-name-a-graph-with-directed-edges-and-without-cycle","title":"How to name a graph with directed edges and without cycle","text":"Directed Acyclic Graph (DAG)
"},{"location":"#how-to-name-a-graph-with-few-edges-and-with-many-edges","title":"How to name a graph with few edges and with many edges","text":"Sparse: few edges
Dense: many edges
"},{"location":"#how-to-name-the-number-of-edges","title":"How to name the number of edges","text":"Degree of a vertex
"},{"location":"#how-to-represent-the-edges-of-a-graph-structure-and-complexity","title":"How to represent the edges of a graph (structure and complexity)","text":"Using an adjacency matrix: two-dimensional array of boolean with a[i][j] is true if there is an edge between node i and j
Time complexity: O(1)
Problem: - If graph is undirected: half of the space is useless - If graph is sparse, we still have to consume O(v\u00b2) space
Using an adjacency list: array (or map) of linked list with a[i] represents the edges for the node i
Time complexity: O(d) with d the degree of a vertex
Time and space: O(v + e)
"},{"location":"#topological-sort-technique","title":"Topological sort technique","text":"If there is an edge from U to V, then U <= V
Possible only if the graph is a DAG
Algo: - Create a graph representation (adjacency list) and an in degree counter (Map) - Zero them for each vertex - Fill the adjacency list and the in degree counter for each edge - Add in a queue each vertex whose in degree count is 0 (source vertex with no parent) - While the queue is not empty, poll a vertex from it then decrement the in degree of its children (no removal)
To check if there is a cycle, we must compare the size of the produced array to the number of vertices
List<Integer> sort(int vertices, int[][] edges) {\n if (vertices == 0) {\n return Collections.EMPTY_LIST;\n }\n\n List<Integer> sorted = new ArrayList<>(vertices);\n // Adjacency list graph\n Map<Integer, List<Integer>> graph = new HashMap<>();\n // Count of incoming edges for each vertex\n Map<Integer, Integer> inDegree = new HashMap<>();\n\n for (int i = 0; i < vertices; i++) {\n inDegree.put(i, 0);\n graph.put(i, new LinkedList<>());\n }\n\n // Init graph and inDegree\n for (int[] edge : edges) {\n int parent = edge[0];\n int child = edge[1];\n\n graph.get(parent).add(child);\n inDegree.put(child, inDegree.get(child) + 1);\n }\n\n // Create a source queue and add each source (a vertex whose inDegree count is 0)\n Queue<Integer> sources = new LinkedList<>();\n for (Map.Entry<Integer, Integer> entry : inDegree.entrySet()) {\n if (entry.getValue() == 0) {\n sources.add(entry.getKey());\n }\n }\n\n while (!sources.isEmpty()) {\n int vertex = sources.poll();\n sorted.add(vertex);\n\n // For each vertex, we will decrease the inDegree count of its children\n List<Integer> children = graph.get(vertex);\n for (int child : children) {\n inDegree.put(child, inDegree.get(child) - 1);\n if (inDegree.get(child) == 0) {\n sources.add(child);\n }\n }\n }\n\n // Topological sort is not possible as the graph has a cycle\n if (sorted.size() != vertices) {\n return new ArrayList<>();\n }\n\n return sorted;\n}\n
"},{"location":"#travelling-salesman-problem","title":"Travelling salesman problem","text":"Find the shortest possible route that visits every city (vertex) exactly once
Possible solutions: - Greedy: nearest neighbour - Dynamic programming: compute optimal solution for a path of length n by using information already known for partial tours of length n-1 (time complexity: n^2 * 2^n)
"},{"location":"#two-types-of-graphs","title":"Two types of graphs","text":"Directed graph (with directed edges)
Undirected graph (with undirected edges)
"},{"location":"#greedy","title":"Greedy","text":""},{"location":"#best-first-search-algorithm_1","title":"Best-first search algorithm","text":"Greedy solution (non-complete) to find the shortest path to a target node
Algorithm: - Put initial state in a priority queue - While target not reached: poll an element and inserts all neighbours
Priority is computed using the evaluation function: f(n) = h where h is an heuristic (local cost to visit a node)
"},{"location":"#greedy-algorithm","title":"Greedy algorithm","text":"Algorithm paradigm of making the locally optimal choice at each stage using a heuristic function
A locally optimal function does not necesseraly mean to not have a global context for taking a decision
Never reconsider a choice (main difference with dynamic programming)
Solution found may not be the most optimal one
"},{"location":"#greedy-algorithm-structure","title":"Greedy algorithm: structure","text":"Often, the global context is spread into a priority queue
"},{"location":"#greedy-technique","title":"Greedy technique","text":"Identify an optimal subproblem or substructure in the problem and determine how to reach it
Focus on what you have now (don't think about what comes next)
We may want to apply the traversal technique to have a global context for the identification part (a map of letters/positions etc.)
"},{"location":"#technique-optimization-problems-requiring-a-min-or-max","title":"Technique - Optimization problems requiring a min or max","text":"Greedy technique
"},{"location":"#hash-table","title":"Hash Table","text":""},{"location":"#hash-table-complexity-search-insert-delete_1","title":"Hash table complexity: search, insert, delete","text":"All: amortized O(1), worst O(n)
"},{"location":"#hash-table-implementation","title":"Hash table implementation","text":"Resize the array when a threshold is reached
If extreme nonuniform distribution, could be replaced by array of BST
"},{"location":"#heap","title":"Heap","text":""},{"location":"#binary-heap-min-heap-or-max-heap-complexity-insert-get-min-max-delete-min-max_1","title":"Binary heap (min-heap or max-heap) complexity: insert, get min (max), delete min (max)","text":"Insert: O(log (n))
Get min (max): O(1)
Delete min: O(log n)
"},{"location":"#binary-heap-min-heap-or-max-heap-data-structure-used-for-the-implementation","title":"Binary heap (min-heap or max-heap) data structure used for the implementation","text":"Using an array
If children at index i: - Left children: 2 * i + 1 - Right children: 2 * i + 2 - Parent: (i - 1) / 2
"},{"location":"#binary-heap-min-heap-or-max-heap-definition","title":"Binary heap (min-heap or max-heap) definition","text":"A binary heap is a a complete binary tree with min-heap or max-heap property ordering. Also called min heap or max heap.
Min heap: each node smaller than its children, min value element at the root.
Two operations: insert(), getMin()
Difference BST: in a BST, each smaller element is on the left and greater element on the right, here a smaller element can be found on the left or the right side.
"},{"location":"#binary-heap-min-heap-or-max-heap-delete-min","title":"Binary heap (min-heap or max-heap) delete min","text":"Replace min element (root) with the last node (left-most, lowest-level node because a binary heap is a complete binary tree)
If violations, swap with the smallest child (level by level)
"},{"location":"#binary-heap-min-heap-or-max-heap-insert-algorithm","title":"Binary heap (min-heap or max-heap) insert algorithm","text":"Insert node at the end (left-most spot because a binary heap is a complete binary tree)
If violations, swap with parents until no more violation
"},{"location":"#binary-heap-min-heap-or-max-heap-use-cases","title":"Binary heap (min-heap or max-heap) use-cases","text":"Priority queue
"},{"location":"#comparator-implementation-to-order-two-integers_1","title":"Comparator implementation to order two integers","text":"Ordering, min-heap: (a, b) -> a - b
Reverse ordering, max-heap: (a, b) -> b - a
"},{"location":"#convert-an-array-into-a-binary-heap-in-place","title":"Convert an array into a binary heap in place","text":"For i from 0 to n-1, swap recursively element a[i] until min/max heap violation on its node
"},{"location":"#find-the-median-of-a-stream-of-numbers-2-methods-insertint-and-int-findmedian","title":"Find the median of a stream of numbers, 2 methods insert(int) and int findMedian()","text":"Solution: two heap technique
Keep two heaps and maintain the balance by transfering an element from one heap to another if not balanced
Return the median (difference if even or odd)
// First half\nPriorityQueue<Integer> maxHeap = new PriorityQueue<>((a, b) -> b - a);\n// Second half\nPriorityQueue<Integer> minHeap = new PriorityQueue<>();\n\npublic void insertNum(int n) {\n // First element\n if (minHeap.isEmpty()) {\n minHeap.add(n);\n return;\n }\n\n // Insert into min or max heap\n Integer minSecondHalf = minHeap.peek();\n if (n >= minSecondHalf) {\n minHeap.add(n);\n } else {\n maxHeap.add(n);\n }\n\n // Is balanced?\n if (minHeap.size() > maxHeap.size() + 1) {\n maxHeap.add(minHeap.poll());\n } else if (maxHeap.size() > minHeap.size() + 1) {\n minHeap.add(maxHeap.poll());\n }\n}\n\npublic double findMedian() {\n // Even\n if (minHeap.size() == maxHeap.size()) {\n return (double) (minHeap.peek() + maxHeap.peek()) / 2;\n }\n\n // Odd\n if (minHeap.size() > maxHeap.size()) {\n return minHeap.peek();\n }\n return maxHeap.peek();\n}\n
"},{"location":"#given-an-unsorted-array-of-numbers-find-the-k-largest-numbers-in-it","title":"Given an unsorted array of numbers, find the K largest numbers in it","text":"Solution: using a min heap but we keep only K elements in it
public static List<Integer> findKLargestNumbers(int[] nums, int k) {\n PriorityQueue<Integer> minHeap = new PriorityQueue<>();\n\n // Put the first K numbers\n for (int i = 0; i < k; i++) {\n minHeap.add(nums[i]);\n }\n\n // Iterate on the rest of the array\n // Check whether the current element is bigger than the smallest one\n for (int i = k; i < nums.length; i++) {\n if (nums[i] > minHeap.peek()) {\n minHeap.poll();\n minHeap.add(nums[i]);\n }\n }\n\n return toList(minHeap);\n}\n\npublic static List<Integer> toList(PriorityQueue<Integer> minHeap) {\n List<Integer> list = new ArrayList<>(minHeap.size());\n while (!minHeap.isEmpty()) {\n list.add(minHeap.poll());\n }\n\n return list;\n}\n
Space complexity: O(k)
"},{"location":"#heapsort-algorithm","title":"Heapsort algorithm","text":"Stable
"},{"location":"#time-complexity-to-build-a-binary-heap_1","title":"Time complexity to build a binary heap","text":"O(n)
"},{"location":"#two-heaps-technique","title":"Two heaps technique","text":"Keep two heaps: - A max heap for the first half - Then a min heap for the second half
May be required to balance them to have at most a difference in terms of size of 1
"},{"location":"#why-binary-heap-over-bst-for-priority-queue","title":"Why binary heap over BST for priority queue?","text":"BST needs an extra pointer to the min or max value (otherwise finding the min or max is O(log n))
Implemented using an array: faster in practice (better locality, more cache friendly)
Building a binary heap is O(n), instead of O(n log n) for a BST
"},{"location":"#linked-list","title":"Linked List","text":""},{"location":"#algorithm-to-reverse-a-linked-list","title":"Algorithm to reverse a linked list","text":"public ListNode reverse(ListNode head) {\n ListNode previous = null;\n ListNode current = head;\n\n while (current != null) {\n // Keep temporary next node\n ListNode next = current.next;\n // Change link\n current.next = previous;\n // Move previous and current\n previous = current;\n current = next;\n }\n\n return previous;\n}\n
"},{"location":"#doubly-linked-list","title":"Doubly linked list","text":"Each node contains a pointer to the previous and the next node
"},{"location":"#doubly-linked-list-complexity-access-insert-delete_1","title":"Doubly linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#get-the-middle-of-a-linked-list","title":"Get the middle of a linked list","text":"Using the runner technique
"},{"location":"#iterate-over-two-linked-lists","title":"Iterate over two linked lists","text":"while (l1 != null || l2 != null) {\n\n}\n
"},{"location":"#linked-list-complexity-access-insert-delete_1","title":"Linked list complexity: access, insert, delete","text":"Access: O(n)
Insert: O(1)
Delete: O(1)
"},{"location":"#linked-list-questions-prerequisite","title":"Linked list questions prerequisite","text":"Single or doubly linked list?
"},{"location":"#queue-implementations-and-insertdelete-complexity","title":"Queue implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(1)
Delete: O(1)
"},{"location":"#ring-buffer-or-circular-buffer-structure","title":"Ring buffer (or circular buffer) structure","text":"Data structure using a single, fixed-sized buffer as if it were connected end-to-end
"},{"location":"#what-if-we-need-to-iterate-backwards-on-a-singly-linked-list-in-constant-space-without-mutating-the-input","title":"What if we need to iterate backwards on a singly linked list in constant space without mutating the input?","text":"Reverse the liked list (or a subpart only), implement the algo then reverse it again to the initial state
"},{"location":"#math","title":"Math","text":""},{"location":"#a-a-property","title":"a = a property","text":"Reflexive
"},{"location":"#if-a-b-and-b-c-then-a-c-property","title":"If a = b and b = c then a = c property","text":"Transitive
"},{"location":"#if-a-b-then-b-a-property","title":"If a = b then b = a property","text":"Symmetric
"},{"location":"#logarithm-definition","title":"Logarithm definition","text":"Inverse function to exponentiation
If odd: middle value
If even: average of the two middle values (1, 2, 3, 4 => (2 + 3) / 2 = 2.5)
"},{"location":"#n-choose-k-problems","title":"n-choose-k problems","text":"From a set of n items, choose k items with 0 <= k <= n
P(n, k)
Order matters: n! / (n - k)! // How many permutations
Order does not matter: n! / ((n - k)! k!) // How many combinations
"},{"location":"#probability-pa-b-inter","title":"Probability: P(a \u2229 b) // inter","text":"P(a \u2229 b) = P(a) * P(b)
"},{"location":"#probability-pa-b-union","title":"Probability: P(a \u222a b) // union","text":"P(a \u222a b) = P(a) + P(b) - P(a \u2229 b)
"},{"location":"#probability-pba-probability-of-a-knowing-b","title":"Probability: Pb(a) // probability of a knowing b","text":"Pb(a) = P(a \u2229 b) / P(b)
"},{"location":"#queue","title":"Queue","text":""},{"location":"#dequeue-data-structure","title":"Dequeue data structure","text":"Double ended queue for which elements can be added or removed from either the front (head) or the back (tail)
"},{"location":"#queue_1","title":"Queue","text":"FIFO (First In First Out)
"},{"location":"#queue-implementations-and-insertdelete-complexity_1","title":"Queue implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(1)
Delete: O(1)
"},{"location":"#recursion","title":"Recursion","text":""},{"location":"#how-to-handle-a-recursive-function-that-need-to-return-a-list","title":"How to handle a recursive function that need to return a list","text":"Input: - Result List - Current iteration element
Output: void
void f(List<String> result, String current) {\n // Do something\n result.add(...);\n}\n
"},{"location":"#how-to-handle-a-recursive-function-that-need-to-return-a-maximum-value","title":"How to handle a recursive function that need to return a maximum value","text":"Implementation: return max(f(a), f(b))
"},{"location":"#loop-inside-of-a-recursive-function","title":"Loop inside of a recursive function?","text":"Might be a code smell. The iteration is already brought by the recursion itself.
"},{"location":"#sort","title":"Sort","text":""},{"location":"#bubble-sort-algorithm","title":"Bubble sort algorithm","text":"Walk through a collection and compares 2 elements at a time
If they are out of order, swap them
Continue until the entire collection is sorted
"},{"location":"#bubble-sort-complexity-and-stability_1","title":"Bubble sort complexity and stability","text":"Time: O(n\u00b2)
Space: O(1)
Stable
"},{"location":"#counting-sort-complexity-stability-use-case_1","title":"Counting sort complexity, stability, use case","text":"Time complexity: O(n + k) // n is the number of elements, k is the range (the maximum element)
Space complexity: O(k)
Stable
Use case: known and small range of possible integers
"},{"location":"#counting-sort-algorithm","title":"Counting sort algorithm","text":"If range r is known
1) Create an array of size r where each a[i] represents the number of occurences of i
2) Modify the array to store the cumulative sum (if a=[1, 3, 0, 2] => [1, 4, 4, 6])
3) Right shift the array with a backward iteration (element at index 0 is 0 => [0, 1, 4, 4]) Now a[i] represents the first index of i if array was sorted
4) Create the sorted array by filling the elements from their first index
"},{"location":"#heapsort-algorithm_1","title":"Heapsort algorithm","text":"Time: Theta(n log n)
Space: O(1)
Unstable
Use case: space constrained environment with O(n log n) time guarantee
Yet, not stable and not cache friendly
"},{"location":"#insertion-sort-algorithm","title":"Insertion sort algorithm","text":"From i to 0..n, insert a[i] to its correct position to the left (0..i)
Used by humans
"},{"location":"#insertion-sort-complexity-stability-use-case_1","title":"Insertion sort complexity, stability, use case","text":"Time: O(n\u00b2)
Space: O(1)
Stable
Use case: partially sorted structure
"},{"location":"#mergesort-algorithm","title":"Mergesort algorithm","text":"Splits a collection into 2 halves, sort the 2 halves (recursive call) then merge them together to form one sorted collection
void mergeSort(int[] a) {\n int[] helper = new int[a.length];\n mergeSort(a, helper, 0, a.length - 1);\n}\n\nvoid mergeSort(int a[], int helper[], int lo, int hi) {\n if (lo < hi) {\n int mid = (lo + hi) / 2;\n\n mergeSort(a, helper, lo, mid);\n mergeSort(a, helper, mid + 1, hi);\n merge(a, helper, lo, mid, hi);\n }\n}\n\nprivate void merge(int[] a, int[] helper, int lo, int mid, int hi) {\n // Copy into helper\n for (int i = lo; i <= hi; i++) {\n helper[i] = a[i];\n }\n\n int p1 = lo; // Pointer on the first half\n int p2 = mid + 1; // Pointer on the second half\n int index = lo; // Index of a\n\n // Copy the smallest values from either the left or the right side back to the original array\n while (p1 <= mid && p2 <= hi) {\n if (helper[p1] <= helper[p2]) {\n a[index] = helper[p1];\n p1++;\n } else {\n a[index] = helper[p2];\n p2++;\n }\n index++;\n }\n\n // Copy the eventual rest of the left side of the array into the target array\n while (p1 <= mid) {\n a[index] = helper[p1];\n index++;\n p1++;\n }\n}\n
"},{"location":"#further-reading_2","title":"Further Reading","text":"Time: Theta(n log n)
Space: O(n)
Stable
Use case: good worst case time complexity and stable, good with linked list
"},{"location":"#quicksort-algorithm","title":"Quicksort algorithm","text":"Sort a collection by repeatedly choosing a pivot and partitioning the collection around it (smaller before, larger after)
Here the pivot will be the last element of the subarray
In an ideal world, the pivot would be the middle element so that we partition the array in two subsets of equal size
The worst case is to find a pivot element at the top left or top right index of the subarray
void quickSort(int[] a) {\n quickSort(a, 0, a.length - 1);\n}\n\nvoid quickSort(int a[], int lo, int hi) {\n if (lo < hi) {\n int pivot = partition(a, lo, hi);\n quickSort(a, lo, pivot - 1);\n quickSort(a, pivot + 1, hi);\n }\n}\n\n// Returns an index so that all element before that index are smaller\n// And all element after are bigger\nint partition(int a[], int lo, int hi) {\n int pivot = a[hi];\n int pivotIndex = lo; // Will represent the pivot index\n\n // Iterate using the two pointers technique\n for (int i = lo; i < hi; i++) {\n // If the current index is smaller, swap and increment pivot index\n if (a[i] <= pivot) {\n swap(a, pivotIndex++, i);\n }\n }\n\n swap(a, pivotIndex, hi);\n return pivotIndex;\n}\n
"},{"location":"#quicksort-complexity-stability-use-case_1","title":"Quicksort complexity, stability, use case","text":"Time: best and average O(n log n), worst O(n\u00b2) if the array is already sorted in ascending or descending order
Space: O(log n) // In-place sorting algorithm
Not stable
Use case: in practice, quicksort is often faster than merge sort due to better locality (not applicable with linked list so in this case we prefer mergesort)
"},{"location":"#radix-sort-algorithm","title":"Radix sort algorithm","text":"Sort by applying counting sort on one digit at a time (least to most significant) Each new level must be stable (if equals, keep the order of the previous level)
Example:
Time complexity: O(nk) // n is the number of elements, k is the maximum number of digits for a number
Space complexity: O(k)
Stable
Use case: if k < log(n) (for example 1M of elements from 0..1000 as 4 < log(1M))
"},{"location":"#selection-sort-algorithm","title":"Selection sort algorithm","text":"From i to 0..n, find repeatedly the min element then swap it with i
"},{"location":"#selection-sort-complexity_1","title":"Selection sort complexity","text":"Time: Theta(n\u00b2)
Space: O(1)
"},{"location":"#shuffling-an-array","title":"Shuffling an array","text":"Fisher-Yates shuffle algorithm: - Iterate over each element (i) - Pick a random index (from 0 to i included) and swap with the current element
"},{"location":"#stack","title":"Stack","text":""},{"location":"#stack_1","title":"Stack","text":"LIFO (Last In First Out)
"},{"location":"#stack-implementations-and-insertdelete-complexity_1","title":"Stack implementations and insert/delete complexity","text":"Insert: O(1)
Delete: O(1)
Insert: O(n), amortized time O(1)
Delete: O(1)
"},{"location":"#string","title":"String","text":""},{"location":"#first-check-to-test-if-two-strings-are-a-permutation-or-a-rotation-of-each-other","title":"First check to test if two strings are a permutation or a rotation of each other","text":"Same length
"},{"location":"#how-to-print-all-the-possible-permutations-of-a-string","title":"How to print all the possible permutations of a string","text":"Recursion with backtracking
void permute(String s) {\n permute(s, 0);\n}\n\nvoid permute(String s, int index) {\n if (index == s.length() - 1) {\n System.out.println(s);\n return;\n }\n\n for (int i = index; i < s.length(); i++) {\n s = swap(s, index, i);\n permute(s, index + 1);\n s = swap(s, index, i);\n }\n}\n
"},{"location":"#rabin-karp-substring-search","title":"Rabin-Karp substring search","text":"Searching a substring s in a string b takes O(s(b-s)) time
Trick: compute the hash of each substring s
Sliding window of size s
Time complexity: O(b)
If hash matches, check if the string are equals (as two different strings can have the same hash)
"},{"location":"#string-permutation-vs-rotation","title":"String permutation vs rotation","text":"Permutation: contains the same characters in an order that can be different (abdc and dabc)
Rotation: rotates according to a pivot
"},{"location":"#string-questions-prerequisite","title":"String questions prerequisite","text":"Case sensitive?
Encoding?
"},{"location":"#technique","title":"Technique","text":"14 Patterns to Ace Any Coding Interview Question by Fahim ul Haq
"},{"location":"#01-knapsack-brute-force-technique","title":"0/1 Knapsack brute force technique","text":"Recursive approach: solve f(c, i) with c is the remaining capacity and i is th current item index At each level, we branch with the item at index i (if enough capacity) and without it
public int knapsack(int[] profits, int[] weights, int c) {\n return knapsack(profits, weights, c, 0, 0);\n}\n\npublic int knapsack(int[] profits, int[] weights, int c, int i, int sum) {\n if (i == profits.length || c <= 0) {\n return sum;\n }\n\n // Not\n int sum1 = knapsack(profits, weights, c, i + 1, sum);\n\n // With\n int sum2 = 0;\n if (weights[i] <= c) {\n sum2 = knapsack(profits, weights, c - weights[i], i + 1, sum + profits[i]);\n }\n\n return Math.max(sum1, sum2);\n}\n
"},{"location":"#01-knapsack-memoization-technique","title":"0/1 Knapsack memoization technique","text":"Memoization: store a[c][i] (c is the remaining capacity, i is the current item index)
As we need to store the 0 capacity, we have to init the array this way:
int[][] a = new int[c + 1][n] // n is the number of items
Time and space complexity: O(n * c)
public int knapsack(int[] profits, int[] weights, int capacity) {\n // Capacity from 1 to n\n Integer[][] a = new Integer[capacity][profits.length];\n return knapsack(profits, weights, capacity, 0, 0, a);\n}\n\npublic int knapsack(int[] profits, int[] weights, int capacity, int i, int sum, Integer[][] a) {\n if (i == profits.length || capacity == 0) {\n return sum;\n }\n\n // If value already exists, return \n if (a[capacity - 1][i] != null) {\n return a[capacity][i];\n }\n\n // With\n int sum1 = knapsack(profits, weights, capacity, i + 1, sum, a);\n // Without\n int sum2 = 0;\n if (weights[i] <= capacity) {\n sum2 = knapsack(profits, weights, capacity - weights[i], i + 1, sum + profits[i], a);\n }\n\n a[capacity - 1][i] = Math.max(sum1, sum2);\n return a[capacity - 1][i];\n}\n
"},{"location":"#01-knapsack-tabulation-technique","title":"0/1 Knapsack tabulation technique","text":"Two dimensional array: a[n + 1][c + 1] // n the number of items and c the max capacity
First row and first column are set to 0
a[row][col] represent the max profit with items 1..row at capacity col
remainingWeight = col - itemWeight // col: current max capacity
a[row][col] = max(a[row - 1][col], itemValue + a[row - 1][remainingWeight]) // max between item not selected and item selected + max remaining weight
If remainingWeight < 0, we can't chose the item so a[row][col] = a[row - 1][col]
Return last element of the array
public int solveKnapsack(int[] profits, int[] weights, int capacity) {\n int[][] a = new int[profits.length + 1][capacity + 1];\n\n for (int row = 1; row < profits.length + 1; row++) {\n int value = profits[row - 1];\n int weight = weights[row - 1];\n for (int col = 1; col < capacity + 1; col++) {\n int remainingWeight = col - weight;\n if (remainingWeight < 0) {\n a[row][col] = a[row - 1][col];\n } else {\n a[row][col] = Math.max(\n a[row - 1][col],\n value + a[row - 1][remainingWeight]\n );\n }\n }\n }\n\n return a[profits.length][capacity];\n}\n
If we need to compute a result like \"determine if a subset exists\" that return a boolean, the array type is boolean[][]
As we are only interested in the previous row, we can also use an int[2][n] array
"},{"location":"#backtracking-technique","title":"Backtracking technique","text":"Solution for solving a problem recursively
Loop: - apply() // Apply a change - try() // Try a solution - reverse() // Reverse apply
"},{"location":"#cyclic-sort-technique","title":"Cyclic sort technique","text":"Iterate over each number of an array and swap it to its correct position
At the end, we may iterate on the array to check which number is not at its correct position
If numbers are not within the 1 to n range, we can simply drop them
Alternative: marker technique (mark a result by setting a[i] to negative for example)
"},{"location":"#greedy-technique_1","title":"Greedy technique","text":"Identify an optimal subproblem or substructure in the problem and determine how to reach it
Focus on what you have now (don't think about what comes next)
We may want to apply the traversal technique to have a global context for the identification part (a map of letters/positions etc.)
"},{"location":"#k-way-merge-technique","title":"K-way merge technique","text":"Given K sorted array, technique to perform a sorted traversal of all the elements of all arrays
We need to keep track of which structure the min element come from (tracking the array index or taking the next node if it's a linked list)
"},{"location":"#runner-technique","title":"Runner technique","text":"Iterate over the linked list with two pointers simultaneously either with: - One ahead by a fixed amount - One faster
This technique can also be applied on other problems where we need to find a cycle (f(slow) and f(f(fast)) may converge)
"},{"location":"#simplification-technique","title":"Simplification technique","text":"Simplify the problem. If solvable, generalize to the initial problem.
Example: sort the array first
"},{"location":"#sliding-window-technique","title":"Sliding window technique","text":"Range of elements in a specific window size
Two pointers left and right: - Move right while condition is valid - Move left if condition is not valid
"},{"location":"#subsets-technique","title":"Subsets technique","text":"Technique to find all the possible permutations or combinations
Start with an empty set, for each element of the input, add them to all the existing subsets to create new subsets
Example: - Given [1, 5, 3] - => [] // Start - => [], [1] - => [], [1], [5], [1,5] - => [], [1], [5], [1,5], [3], [1,3], [1,5,3]
For each level, we iterate from 0 to size // size is the fixed size of the list
List<List<Integer>> findSubsets(int[] a) {\n List<List<Integer>> subsets = new ArrayList<>();\n // Add subset []\n subsets.add(new ArrayList<>());\n\n for (int n : a) {\n // Fix the current size\n int size = subsets.size();\n for (int i = 0; i < size; i++) {\n // Copy subset\n ArrayList<Integer> newSubset = new ArrayList<>(subsets.get(i));\n // Add element\n newSubset.add(n);\n subsets.add(newSubset);\n }\n }\n\n return subsets;\n}\n
"},{"location":"#technique-dealing-with-cycles-in-a-linked-list-or-an-array","title":"Technique - Dealing with cycles in a linked list or an array","text":"Runner technique
"},{"location":"#technique-find-all-the-permutations-or-combinations","title":"Technique - Find all the permutations or combinations","text":"Subsets technique or recursion + backtracking
"},{"location":"#technique-find-an-element-in-a-sorted-array-or-linked-list","title":"Technique - Find an element in a sorted array or linked list","text":"Binary search
"},{"location":"#technique-find-or-calculate-something-among-all-the-contiguous-subarrays-of-a-given-size","title":"Technique - Find or calculate something among all the contiguous subarrays of a given size","text":"Sliding window technique
Example: - Given an array, find the average of all subarrays of size \u2018K\u2019 in it
"},{"location":"#technique-find-the-longestshortest-substring-or-subarray","title":"Technique - Find the longest/shortest substring or subarray","text":"Sliding window technique
Example: - Longest substring with K distinct characters - Longest substring without repeating characters
"},{"location":"#technique-find-the-smallestlargestmedian-element-of-a-set","title":"Technique - Find the smallest/largest/median element of a set","text":"Two heaps technique
"},{"location":"#technique-finding-a-certain-element-in-a-linked-list-eg-middle","title":"Technique - Finding a certain element in a linked list (e.g. middle)","text":"Runner technique
"},{"location":"#technique-given-a-sorted-array-find-a-set-of-elements-that-fullfill-certain-conditions","title":"Technique - Given a sorted array, find a set of elements that fullfill certain conditions","text":"Two pointers technique
Example: - Given a sorted array and a target sum, find a pair in the array whose sum is equal to the given target - Given an array of unsorted numbers, find all unique triplets in it that add up to zero - Comparing strings containing backspaces
"},{"location":"#technique-given-an-array-of-size-n-containing-integer-from-1-to-n-eg-with-one-duplicate","title":"Technique - Given an array of size n containing integer from 1 to n (e.g. with one duplicate)","text":"Cyclic sort technique
"},{"location":"#technique-given-time-intervals","title":"Technique - Given time intervals","text":"Traversal technique
Iterate with two pointers, one over the starts, another one over the ends
Handle the element with the lowest value first and generate an event
Example: how many rooms for n meetings => meeting started, meeting started, meeting ended etc.
"},{"location":"#technique-how-to-get-the-k-biggestsmallestfrequent-elements","title":"Technique - How to get the K biggest/smallest/frequent elements","text":"Top K elements technique
"},{"location":"#technique-optimization-problems-requiring-a-min-or-max_1","title":"Technique - Optimization problems requiring a min or max","text":"Greedy technique
"},{"location":"#technique-problems-featuring-a-list-of-sorted-arrays-merge-or-find-the-smallest-element","title":"Technique - Problems featuring a list of sorted arrays (merge or find the smallest element)","text":"K-way merge technique
"},{"location":"#technique-scheduling-problem-with-n-tasks-where-each-task-can-have-constraints-to-be-completed-before-others","title":"Technique - Scheduling problem with n tasks where each task can have constraints to be completed before others","text":"Topological sort technique
"},{"location":"#technique-situations-like-priority-queue-or-scheduling","title":"Technique - Situations like priority queue or scheduling","text":"Heap data structure
Possibly two heaps technique
"},{"location":"#top-k-elements-technique-biggest-and-smallest","title":"Top K elements technique (biggest and smallest)","text":"Finding the K biggest elements: - Min heap - Add k elements - Then iterate over the remaining elements, if current > min => remove min, add current
Finding the k smallest elements: - Max heap - Add k elements - Then iterate over the remaining elements, if current < max => remove max, add current
"},{"location":"#topological-sort-technique_1","title":"Topological sort technique","text":"If there is an edge from U to V, then U <= V
Possible only if the graph is a DAG
Algo: - Create a graph representation (adjacency list) and an in degree counter (Map) - Zero them for each vertex - Fill the adjacency list and the in degree counter for each edge - Add in a queue each vertex whose in degree count is 0 (source vertex with no parent) - While the queue is not empty, poll a vertex from it then decrement the in degree of its children (no removal)
To check if there is a cycle, we must compare the size of the produced array to the number of vertices
List<Integer> sort(int vertices, int[][] edges) {\n if (vertices == 0) {\n return Collections.EMPTY_LIST;\n }\n\n List<Integer> sorted = new ArrayList<>(vertices);\n // Adjacency list graph\n Map<Integer, List<Integer>> graph = new HashMap<>();\n // Count of incoming edges for each vertex\n Map<Integer, Integer> inDegree = new HashMap<>();\n\n for (int i = 0; i < vertices; i++) {\n inDegree.put(i, 0);\n graph.put(i, new LinkedList<>());\n }\n\n // Init graph and inDegree\n for (int[] edge : edges) {\n int parent = edge[0];\n int child = edge[1];\n\n graph.get(parent).add(child);\n inDegree.put(child, inDegree.get(child) + 1);\n }\n\n // Create a source queue and add each source (a vertex whose inDegree count is 0)\n Queue<Integer> sources = new LinkedList<>();\n for (Map.Entry<Integer, Integer> entry : inDegree.entrySet()) {\n if (entry.getValue() == 0) {\n sources.add(entry.getKey());\n }\n }\n\n while (!sources.isEmpty()) {\n int vertex = sources.poll();\n sorted.add(vertex);\n\n // For each vertex, we will decrease the inDegree count of its children\n List<Integer> children = graph.get(vertex);\n for (int child : children) {\n inDegree.put(child, inDegree.get(child) - 1);\n if (inDegree.get(child) == 0) {\n sources.add(child);\n }\n }\n }\n\n // Topological sort is not possible as the graph has a cycle\n if (sorted.size() != vertices) {\n return new ArrayList<>();\n }\n\n return sorted;\n}\n
"},{"location":"#traversal-technique","title":"Traversal technique","text":"Traverse the input and generate another data structure or optional events
Start the problem from this new state
"},{"location":"#two-heaps-technique_1","title":"Two heaps technique","text":"Keep two heaps: - A max heap for the first half - Then a min heap for the second half
May be required to balance them to have at most a difference in terms of size of 1
"},{"location":"#two-pointers-technique","title":"Two pointers technique","text":"Two pointers iterating through the data structure in tandem until one or both pointers hit a certain condition
Often useful when structure is sorted. If not sorted, we may want to sort it first.
Most of the times (not always): first pointer is at the start, the second pointer is at the end
The two pointers can also be on two different ds, still iterating in tandem (e.g. comparing strings containing backspaces)
Time complexity is linear
"},{"location":"#what-if-we-need-to-iterate-backwards-on-a-singly-linked-list-in-constant-space-without-mutating-the-input_1","title":"What if we need to iterate backwards on a singly linked list in constant space without mutating the input?","text":"Reverse the liked list (or a subpart only), implement the algo then reverse it again to the initial state
"},{"location":"#tree","title":"Tree","text":""},{"location":"#2-3-tree","title":"2-3 tree","text":"Self-balanced BST => O(log n) complexity
Either: - 2-node: contains a single value and has two children - 3-node: contains two values and has three children - Leaf: 1 or 2 keys
Insert: find proper leaf and insert the value in-place. If the leaf has 3 values (called temporary 4-node), split the node into three 2-node and insert the middle value into the parent.
"},{"location":"#avl-tree","title":"AVL tree","text":"If tree is not balanced, rearange the nodes with single or double rotations
"},{"location":"#b-tree-complexity-access-insert-delete_1","title":"B-tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#b-tree-definition-and-use-case","title":"B-tree: definition and use case","text":"Self-balanced BST => O(log n) complexity
Can have more than two children (generalization of 2-3 tree)
Use-case: huge amount of data that cannot fit in main memory but disk space.
Height is kept low to reduce the disk accesses.
Match how page disk are working
"},{"location":"#balanced-binary-tree-definition","title":"Balanced binary tree definition","text":"The balance factor of each node (the difference between the two subtree heights) should never exceed 1
Guarantee of O(log n) search
"},{"location":"#balanced-bst-use-case-b-tree-red-black-tree-avl-tree","title":"Balanced BST use case: B-tree, Red-black tree, AVL tree","text":"BFS: time O(v), space O(v)
DFS: time O(v), space O(h) (height of the tree)
"},{"location":"#binary-tree-bfs-traversal","title":"Binary tree BFS traversal","text":"Level order traversal (level by level)
Iterative algorithm: use a queue, put the root, iterate while queue is not empty
Queue<Node> queue = new LinkedList<>();\nqueue.add(root);\n\nwhile(!queue.isEmpty()) {\n Node node = queue.poll();\n visit(node);\n\n if(node.left != null) {\n queue.add(node.left);\n }\n if(node.right != null) {\n queue.add(node.right);\n }\n}\n
"},{"location":"#binary-tree-definition","title":"Binary tree definition","text":"Tree with each node having up to two children
"},{"location":"#binary-tree-dfs-traversal-in-order-pre-order-and-post-order","title":"Binary tree DFS traversal: in-order, pre-order and post-order","text":"It's depth first so:
Every level of the tree is fully filled, with last level filled from the left to the right
"},{"location":"#binary-tree-full","title":"Binary tree: full","text":"Each node has 0 or 2 children
"},{"location":"#binary-tree-perfect","title":"Binary tree: perfect","text":"2^l - 1 nodes with l the level: 1, 3, 7, etc. nodes
Every level is fully filled
"},{"location":"#bst-complexity-access-insert-delete","title":"BST complexity: access, insert, delete","text":"If not balanced O(n)
If balanced O(log n)
"},{"location":"#bst-definition","title":"BST definition","text":"Binary tree in which every node must fit the property: all left descendents <= n < all right descendents
Implementation: optional key, value, left, right
"},{"location":"#bst-delete-algo-and-complexity_1","title":"BST delete algo and complexity","text":"Find inorder successor and swap it
Average: O(log n)
Worst: O(h) if not self-balanced BST, otherwise O(log n)
"},{"location":"#bst-insert-algo","title":"BST insert algo","text":"Search for key or value (by recursively going left or right depending on the comparison) then insert a new node or reset the value (no swap)
Complexity: worst O(n)
public TreeNode insert(TreeNode root, int a) {\n if (root == null) {\n return new TreeNode(a);\n }\n\n if (root.val <= a) { // Left\n root.left = insert(root.left, a);\n } else { // Right\n root.right = insert(root.right, a);\n }\n\n return root;\n}\n
"},{"location":"#bst-questions-prerequisite","title":"BST questions prerequisite","text":"Is it a self-balanced BST? (impacts: O(log n) time complexity guarantee)
"},{"location":"#complexity-to-create-a-trie_1","title":"Complexity to create a trie","text":"Time and space: O(n * l) with n the number of words and l the longest word length
"},{"location":"#complexity-to-insert-a-key-in-a-trie_1","title":"Complexity to insert a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative, O(k) recursive
"},{"location":"#complexity-to-search-for-a-key-in-a-trie_1","title":"Complexity to search for a key in a trie","text":"Time: O(k) with k the size of the key
Space: O(1) iterative or O(k) recursive
"},{"location":"#given-a-binary-tree-algorithm-to-populate-an-array-to-represent-its-level-by-level-traversal","title":"Given a binary tree, algorithm to populate an array to represent its level-by-level traversal","text":"Solution: BFS by popping only a fixed number of elements (queue.size)
public static List<List<Integer>> traverse(TreeNode root) {\n List<List<Integer>> result = new LinkedList<>();\n Queue<TreeNode> queue = new LinkedList<>();\n queue.add(root);\n while (!queue.isEmpty()) {\n List<Integer> level = new ArrayList<>();\n\n int levelSize = queue.size();\n // Pop only levelSize elements\n for (int i = 0; i < levelSize; i++) {\n TreeNode current = queue.poll();\n level.add(current.val);\n if (current.left != null) {\n queue.add(current.left);\n }\n if (current.right != null) {\n queue.add(current.right);\n }\n }\n result.add(level);\n }\n return result;\n}\n
"},{"location":"#how-to-calculate-the-path-number-of-a-node-while-traversing-using-dfs","title":"How to calculate the path number of a node while traversing using DFS?","text":"Example: 1 -> 7 -> 3 gives 173
Solution: sum = sum * 10 + n
private int dfs(TreeNode node, int sum) {\n if (node == null) {\n return 0;\n }\n\n sum = 10 * sum + node.val;\n\n // Do something\n}\n
"},{"location":"#min-or-max-value-in-a-bst","title":"Min (or max) value in a BST","text":"Move recursively on the left (on the right)
"},{"location":"#red-black-tree","title":"Red-Black tree","text":"Self-balanced BST => O(log n) complexity
Binary Trees: Red Black by David Pynes
"},{"location":"#red-black-tree-complexity-access-insert-delete_1","title":"Red-black tree complexity: access, insert, delete","text":"All: O(log n)
"},{"location":"#reverse-a-binary-tree-algo","title":"Reverse a binary tree algo","text":"public void reverse(Node node) {\n if (node == null) {\n return;\n }\n\n Node temp = node.right;\n node.right = node.left;\n node.left = temp;\n\n reverse(node.left);\n reverse(node.right);\n}\n
"},{"location":"#trie-definition-implementation-and-use-case","title":"Trie definition, implementation and use case","text":"Tree-like data structure with empty root and where each node store characters
Each path down the tree represent a word (until a null node that represents the end of the word)
Usually implemented using a map of children (or a fixed size array with ASCII charset for example)
Use case: dictionnary (save memory)
Also known as prefix tree
"},{"location":"#why-to-use-bst-over-hash-table","title":"Why to use BST over hash table","text":"Sorted keys
#tree
"},{"location":"anki/","title":"Anki","text":"Anki is a free software (Windows/Mac/Linux/iPhone/Android) designed to help remembering information. Anki relies on the concept of spaced repetition which is a proven technique to increase the rate of memorization. Here's a 2-minute video that delves into spaced repetition:
Michael A. Nielsen, \"Augmenting Long-term Memory\"
The single biggest change that Anki brings about is that it means memory is no longer a haphazard event, to be left to chance. Rather, it guarantees I will remember something, with minimal effort. That is, Anki makes memory a choice.
I used Anki myself with Algo Deck and Design Deck and it paid off. This method played a key role in helping me land a role as L5 SWE at Google (senior software engineer).
Here is a flashcard example:
The Anki versions (a clone of the flashcards from this repo) are available via one-time GitHub sponsorships:
Trusted by over 100 developers.
"},{"location":"designdeck/","title":"Design Deck","text":"AnkiCheck the Anki version here.
"},{"location":"designdeck/#cache","title":"Cache","text":""},{"location":"designdeck/#cache-aside","title":"Cache aside","text":"Application is responsible for reading and writing to the DB (using write-through or write-back policy)
The cache doesn't interact with the storage directly
"},{"location":"designdeck/#cache-aside-vs-read-through","title":"Cache aside vs. read-through","text":"Cache aside: - Data model can be different from DB
Read-through: - Same data model as DB - Can use the refresh-ahead pattern
"},{"location":"designdeck/#cache-eviction-policy","title":"Cache eviction policy","text":"Cache to automatically refresh any recently accessed entry prior to its expiration
Used with read-through cache
Main difference: consistency
Write through: 1. Write to the cache and the DB in a single DB transaction (may still lead to cache inconsistency if the DB commit failed) 2. Return
Write back: 1. Write to the cache 2. Return 3. Asynchronously store in DB
"},{"location":"designdeck/#four-main-distributed-cache-benefits","title":"Four main distributed cache benefits","text":"Cache hit ratio: hits / total accesses
"},{"location":"designdeck/#read-through-cache","title":"Read-through cache","text":"Read-through cache sits in-line with the DB
Single entry point
"},{"location":"designdeck/#when-to-use-a-cache","title":"When to use a cache","text":"Content Delivery Network
Network of geographically dispersed servers used to deliver static content (images, CSS, Javascript files, etc.)
Two kinds of CDNs: - Push CDN: we are responsible for providing the content - Pull CDN: CDN is responsible for pulling the right content (expiration to be used)
Pull is easier to handle whereas push gives us more flexibility
Use case for pull: Docker Hub S3 layer
"},{"location":"designdeck/#db","title":"DB","text":""},{"location":"designdeck/#3-main-reasons-to-partition-data","title":"3 main reasons to partition data","text":"Atomic: all transaction succeeds or none does (all or nothing)
Consistency: from one valid state to another (invariants must always be true)
Not necessarily a property of the DB (e.g., foreign key constraint), can be a property of the application (e.g., credits and debits must be balanced)
Different from consistency in eventual consistency (which is more about convergence as the matter is replicating data)
Refers to serializability
Optimization to favor latency over consistency when writing to a DB (e.g., leaderless replication)
Background process to constantly looks for differences in data
Could be used as an alternative or in conjunction with read repair
"},{"location":"designdeck/#byzantine-fault-tolerant","title":"Byzantine fault-tolerant","text":"A system is Byzantine fault-tolerant if it continues to operate correctly if in the case of a Bizantine's problem (some of the nodes malfunctioning, not obeying the protocol or malicious attackers).
"},{"location":"designdeck/#calm-theorem","title":"CALM theorem","text":"Consistency As Logical Monotonicity
A program has a consistent, coordination-free (e.g., consensus-free) distributed implementation if and only if it is monotonic
Consistency in this context doesn't mean linearizability. It focuses on the consistency of the program's output while traditional consistency focus on the consistency of reads and writes.
In CALM, a consistent program is one that produces the same output no matter in which order the inputs are processed and despite any conflicts.
Said differently, does the implementation produce the outcome we expect despite any race condition that may arise.
"},{"location":"designdeck/#cap-theorem","title":"CAP theorem","text":"Consistency, availability, partition tolerance (e.g., one node cut off from the rest of the cluster because of a network partition) => pick 2 out of 3
C refers to linearizability
"},{"location":"designdeck/#caveat-of-serializability","title":"Caveat of serializability","text":"It's possible that serial order is different from the order in which transactions were actually run (latest may not win)
If not, we need a stricter isolation level: strict serializability (serializability + linearizability)
"},{"location":"designdeck/#chain-replication","title":"Chain replication","text":"Replication protocol that uses a different topology than leader based replication protocols like Raft
Left-most process referred as the chain's head, right-most as the chain's tail: - Client send writes to the head, which updates its local state and forwards to the next process in the chain - Next process updates its local state and forwards to the next process in the chain - Etc. - Once the update is received by the tail, the ack flows back to the head which replies to the client that the write succeeded
Fault tolerance is delegated to a dedicated component: control plane - If head fails: the control plane removes it and makes the next as the head - If intermediate node fails: the control plane removes it temporarily from the chain, and then adds it back eventually as the tail - If tail fails: the control plane removes it and makes the predecessor as the new tail
Benefits: - Strongly consistent protocol - Reads are served from the tail without contacting other replicas first which allows a lower response time
Drawbacks: - Writes are slower than quorum-based replication. - A single slow node can slow down all writes. - As reads are served from a single node, it can't be scaled horizontally. A mitigation is to allow intermediate nodes to serve reads but they can do it only if a read is considered as clean (the ack for this object has been returned to the predecessor). // The tail serves as the authority of the latest clean version
Notes: - To avoid the overhead of having a single node handling the writes, we can find a way to shard data and handle multiple chains (see https://engineering.fb.com/2022/05/04/data-infrastructure/delta/)
"},{"location":"designdeck/#chain-replication-vs-consensus","title":"Chain replication vs. consensus","text":"Similar consistency guarantees
Chain replication: - Optimized for reads for CP systems - Better read availability: a chain of n nodes can tolerate up to n-2 nodes failure
Example with 5 nodes: - Chain replication: tolerate up to 3 nodes failure - Consensus with R=3 and W=3: tolerate up to 2 nodes failure
Consensus: - Optimized for writes for CP systems
"},{"location":"designdeck/#change-data-capture-cdc","title":"Change data capture (CDC)","text":"A datastore is selected as the authoritative source of data where all update operations are performed
An event log is then created from this datastore that is consumed by all the remaining operations the same way as in event sourcing
"},{"location":"designdeck/#concurrency-control","title":"Concurrency control","text":"Ensures that correct results for concurrent operations are generated
Pessimistic: lock (mutual exclusion)
Optimistic: checks for conflicts at the end of a transaction
In the end, concurrency control serves the same purpose as atomicity
"},{"location":"designdeck/#consensus","title":"Consensus","text":"Set of processes agreeing on some data value in a fault-tolerant way
Satisfies safety and liveness
"},{"location":"designdeck/#consistency-models","title":"Consistency models","text":"Describe what expectations clients might have in terms of possible returned values despite the existence of multiple copies of data and concurrent access to it
Not the C in ACID but the C in CAP (converging to an end state)
Eventual consistency: all the nodes converge to the same state (not necessarily the latest)
Write follow reads: ensures that writes are ordered after writes that were observed by previous read operations
Example: - P1 reads value => foo - P1 updates value to bar => Every node will converge to bar (a process can't read bar, then foo, regardless of the process) Also known as session causality
Monotonic reads consistency: a client doing several reads in sequence will never go backward in time
Monotonic writes consistency: values originating from the same client appear in the order the client has executed them
Read-after-write-consistency: if a client performs a write, then this write if visible during subsequent reads
Also known as read-your-writes consistency
Causal consistency: operations that are causally related need to be seen in the same order by all the nodes
Sequential consistency: operations appear to take place in some total order, and that order is consistent with the order of operations from each individual clients
Twitter example: no guarantee between which tweet is seen first between two friends posting at the same time, but the ordering is guaranteed for the same friend
Even though there may be multiple replicas, the application does not need to worry about them
C in CAP
Real time guarantees
"},{"location":"designdeck/#cqrs","title":"CQRS","text":"Command Query Responsibility Segregation
Dissociate writes (command) from reads (query)
Pros: - Allows creating stores per use case (e.g., analytics, geospatial) - Scale the read and write parts independently
Cons: - Eventual consistency between the stores
"},{"location":"designdeck/#crdt","title":"CRDT","text":"Conflict-free Replicated Data Types
Data structure that is replicated across nodes: - Replicas are updated independently, concurrently and without coordination - An algo (part of the data type) can perform a deterministic conflict resolution - Replicas are guaranteed to eventually converge to the same state => Strong eventual consistency
Used in the context of collaborative applications
Note: CRDTs can be combined to form new CRDTs
"},{"location":"designdeck/#crdt-and-collaborative-applications-eg-google-docs","title":"CRDT and collaborative applications (e.g., Google Docs)","text":"Compared to OT, each character has a stable identifier (even if characters are added or deleted)
Example: 0 is the beginning of the document, 1 is the end of the document, every character has a fractional number as an ID
May lead to interleaving problems (e.g;, two inserted words by two users are interleaved: \"Alice\", \"Bob\" => \"BoAlibce\"
Interleaving depends on the merging algorithm used (e.g., Treedoc doesn't lead to interleaving)
"},{"location":"designdeck/#db-indexes-tradeoff","title":"DB indexes tradeoff","text":"Speed up read query but slow down writes
"},{"location":"designdeck/#db-internal-components","title":"DB internal components","text":"Optimized state-based CRDTs where only recently applied changes to a state are replicated instead of the full state
"},{"location":"designdeck/#denormalization","title":"Denormalization","text":"Introduce some amount of duplication in a normalized dataset in order to speed up reads (e.g., denormalized document, cache or index)
Cons: - Requires more space - May slow down writes
"},{"location":"designdeck/#design-consideration-when-partitioning-data","title":"Design consideration when partitioning data","text":"Should match the primary access pattern
"},{"location":"designdeck/#downside-of-distributed-transactions","title":"Downside of distributed transactions","text":"Performance penalty
Example: distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions
"},{"location":"designdeck/#event-sourcing","title":"Event sourcing","text":"Ensures that all changes to application state are stored as a sequence of events
"},{"location":"designdeck/#eventual-consistency-requirements","title":"Eventual consistency requirements","text":"Splits up DB by function
"},{"location":"designdeck/#fencing-token","title":"Fencing token","text":"Monotonically increasing token that increments whenever a client acquires a distributed lock
Use case: when writing to a DB, if the provided token has a lower value than the current one, rejects the write
Solve possible issues with lease as an update has to be made from the latest token
"},{"location":"designdeck/#gossip-protocol","title":"Gossip protocol","text":"Peer-to-peer protocol based on the way epidemics spread
No central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors
Useful when broadcasting to a large number of processes like thousands or more, where a deterministic protocol wouldn't scale
"},{"location":"designdeck/#graph-db-main-use-case","title":"Graph DB main use case","text":"Relational can handle simple cases of many-to-many relationships
Yet, if the connections become more complex, it's more natural to start modeling data as a graph
"},{"location":"designdeck/#hinted-handoff","title":"Hinted handoff","text":"Optimization to favor latency over consistency when writing to a DB
If a coordinator node cannot contact the necessary number of replicas, it stores locally the result of the operation and forward it to the failed node(s) after they recovered
Used in sloppy quorums
"},{"location":"designdeck/#hot-spot-in-partitioning","title":"Hot spot in partitioning","text":"Partition is heavily loaded compared to others
Also called skew
"},{"location":"designdeck/#in-a-database-strategy-to-handle-rebalancing","title":"In a database, strategy to handle rebalancing","text":"Not based on key hashing as a rebalancing would be huge
Simple solution: Create many more partitions than nodes and assign several partitions to each node (e.g., a db running on a cluster of 10 nodes may be split into 10k partitions). When a node is added to the cluster, it will steal a few partitions from every existing node
"},{"location":"designdeck/#isolation-levels","title":"Isolation levels","text":"Degree to which transactions are isolated from other concurrent execution transactions
Isolations come at a performance cost (more coordination and synchronization)
=> Can violate integrity constraints
=> decisions can be taken based on data updates that can be rolled back
Fuzzy reads: a transaction reads a value twice but sees a different value in each read because a committed transaction updated the value between the two reads
Lost updates: two transactions reads the same value and then try to update it to two different values, only one update survives
Example: Two transactions read the current inventory size (say 100 items), add respectively 5 and 10 items and then store back the size. Depending on the execution order, then final order can be 110 instead of 115.
Read skew: an integrity constraint seems to be violated because a transaction can only see partial results of another transaction
Write skew: when two transactions read the same objects, and then updates some of those objects
Example: Two on-call doctors for a shift. Both feeling unwell, and decide to request leave. They both click the button at the same time. In the case of a write skew, the two transactions can succeed as for both, when reading the number of available doctors, it was more than one.
Example: Transaction A computes the max and average age of employees. Transaction B is interleaved and inserts a lot of old employees. Thus, the average age could be larger than the max.
"},{"location":"designdeck/#known-crdts","title":"Known CRDTs","text":"Counter: - Grow-only counter: increment only - Positive-negative counter: increment and decrement (combination of two grow only counter: one positive, one negative)
Register (a memory cell storing whatever): - LWW-register: total order using timestamps - Multi-value register: keep track of causality, in case of conflicts it returns all conflicting cases (analogy: Git with an interactive merge resolution)
Set: - Grow-only set: once an element is added it can't be removed - Two-phase set: elements can be added and removed (combination of two grow only set) - LWW-element set (last-write-wins): similar to two-phase set but we associate a timestamp for each element to resolve conflicts - Observed-remove set: use tags instead of timestamps; each element is associated to a list of add-tags and a list of remove-tags (example: vector clocks) - Sequence: used to build collaborative applications (e.g., Treedoc)
"},{"location":"designdeck/#last-write-wins-lww","title":"Last-write-wins (LWW)","text":"Conflict resolution based on timestamp
Used by DynamoDB or Cassandra to resolve conflicts
Shouldn't happen in single-master replication
"},{"location":"designdeck/#leader-election","title":"Leader election","text":"Algorithm to guarantee at most one leader at any given time (safety) and that an election eventually completes (liveness)
"},{"location":"designdeck/#lsm-tree","title":"LSM tree","text":"Log-Structured Merge tree
Consists of smaller mutable memory-resident (memtable) and larger immutable disk-resident (SSTable) components
Memtables data are sorted and flushed on disk when their size reaches a configurable threshold or periodically
Because of a memtable is just a special case of buffer, durability is not guaranteed (durability must be brought by replication)
Examples: Lucene, Cassandra, Bitcask, etc.
"},{"location":"designdeck/#lsm-tree-vs-b-tree","title":"LSM tree vs. B-tree","text":"LSM-tree faster for writes, slower for reads because it has to check multiple data structures (bigger read amplification): memtable and SSTable
Compaction can impact ongoing requests
B-tree faster for reads, slower for writes as it must write every piece of data at least twice in the WAL & tree itself (bigger write amplification)
Each key exists in exactly one place => easier to offer strong transactional semantics
"},{"location":"designdeck/#main-difference-between-consistency-models-and-isolation-levels","title":"Main difference between consistency models and isolation levels","text":"Consistency models: applies to single-object operations
Isolation levels: applies to multi-object operations
"},{"location":"designdeck/#merkle-tree","title":"Merkle tree","text":"A tree in which every leaf is labelled with the hash of a data block: - Level n contains the data blocks - Level n-1 the hash of one data block - Level n-2 the hash of 2 data blocks - Level 1 the hash of all the data blocks
Efficient and secure verification of the contents of a large data structure
Allows reducing data transfered between a client and a server. For example, if we want to compare a merkle tree stored on a server with one store on the client, they can both exchange their top hash. If different, we can delve in and only get the data blocks which have changed.
"},{"location":"designdeck/#monotonic-reads-consistency-implementation","title":"Monotonic reads consistency implementation","text":"One way to achieve it is to make sure each user always makes their reads from the same replica
"},{"location":"designdeck/#mvcc","title":"MVCC","text":"Multiversion Concurrency Control
A possible implementation of optimistic concurrency control and snapshot isolation level
MVCC allows reads and writes to proceed with minimal coordination on the storage level since reads can continue accessing older values until the new ones are committed
"},{"location":"designdeck/#n1-select-problem","title":"N+1 select problem","text":"Assuming a one-to-many relationship between 2 tables A and B => A 1-* B
If we want to iterate through all the A and for each one, print the list of B, the naive implementation would be: - select * from A
- And then for each A, select * from B where A_ID = ?
Alternatively, we could reduce the number of rount-trips to the DB from N+1 to 2 with a simple select * from B
Most ORM tools prevent N+1 selects
"},{"location":"designdeck/#nosql-main-types-and-main-architecture-principles","title":"NoSQL: main types and main architecture principles","text":"Key-value store, document store, column-oriented store or graph DB
Commutative replicated data types
Replication is made in propagating the update operation
Operations characteristics: - Must be commutative. - Not necessarily idempotent. If idempotent, OK. If not, it's up to the delivery layer to ensure the operations are delivered without duplication. - Delivered in causal order.
"},{"location":"designdeck/#operational-transformation-ot-concept-and-main-drawback","title":"Operational transformation (OT): concept and main drawback","text":"A way to handle collaborative applications
Receive update operations and depending on the operations that occur concurrently, transform them
Example: - Initial state: \"helo\" - Concurrently: user 1 inserts \"l\" at position 3 and user 2 inserts \"!\" at position 4 - If transaction for user 1 completes before the one of user 2, we end up with \"hell!o\" instead of \"hello!\" - OT will transorm the transaction from user 2 into: insert \"!\" at position 5
Drawback: all the communications go through a central server (e.g., impossible with systems at scale such as Google Docs)
Replaced with CRDT
"},{"location":"designdeck/#optimistic-concurrency-control-pros-and-cons","title":"Optimistic concurrency control: pros and cons","text":"Perform badly if high contention as it leads to a high proportion of retry, thus making performance worse
If not much contention, it tends to perform better than pessimistic
"},{"location":"designdeck/#pacelc-theorem","title":"PACELC theorem","text":"If case of a network partition (P): we should choose between availability (A) or consistency (C)
Else, in the absence of partition (E): we should choose between latency (L) or consistency (C)
Most systems are either: - AP/EL - CP/EC
"},{"location":"designdeck/#partitioning-sharding","title":"Partitioning (sharding)","text":"Split up a large dataset that is too big for a single machine into smaller parts and spread them across several machines
Define the partition type based on the primary access pattern
"},{"location":"designdeck/#partitioning-criteria","title":"Partitioning criteria","text":"Range partitioning: keys are sorted and a partition owns all the keys from some minimum up to some maximum (example: MySQL RANGE COLUMNS partitioning) - Pros: efficient range queries - Cons: Risk of hot spots, requires repartitioning to potentially split a range into two subranges if a partition gets too big
Hash partitioning: hash function is applied to each key and a partition owns a range of hashes
"},{"location":"designdeck/#partitioning-methods","title":"Partitioning methods","text":"Horizontal partitioning: partition by rows
Vertical partitioning: partition by columns (create tables with fewer columns)
Rationale: if the subtables have different access patterns (e.g., a column is a blob that we rarely consume, we can create a vertical partitioning to store this blob not on the primary disk)
Also called normalization
"},{"location":"designdeck/#quorum","title":"Quorum","text":"Minimum number of nodes that need to vote on an operation before it can be considered successful
Usually: majority
"},{"location":"designdeck/#raft","title":"Raft","text":"Leader election and replication algorithms
"},{"location":"designdeck/#leader-election_1","title":"Leader election","text":"Using a state machine to elect a leader
Each process is in one of these three states: leader, candidate (part of the election process), follower
"},{"location":"designdeck/#replication","title":"Replication","text":"The leader stores the sequence of operations altering the state into a local ordered log
Then, this log is replicated across followers Each entry is considered as committed when it has been replicated on a majority of nodes
Replication enables consensus
"},{"location":"designdeck/#read-repair","title":"Read repair","text":"Optimization to favor latency over consistency when writing to a DB (e.g., leaderless replication)
If a coordinator node receives conflicting values from the contacted replicas (which shouldn't happen in case of single-master replication for example), it resolves the conflict by: - Resolving the conflict (e.g., LWW) - Forwarding it to the stale replica - Responding to the read request
"},{"location":"designdeck/#relation-between-replication-factor-write-consistency-and-read-consistency","title":"Relation between replication factor, write consistency and read consistency","text":"Given: - N: number of replicas - W: number of nodes that have to ack a write for it to succeed - R: number of nodes that have to respond to a read operation for it to succeed
If R+W > N, the system can guarantee to return the most recent written value because there's always an overlap between read and write sets (consistency)
Notes: - In case of read-heavy systems, we want to minimize R - If W = 1 and R = N, durability isn't guaranteed in the presence of failure - If W < (N+1)/2, it may leads to write conflicts (e.g., W < 2 if 3 nodes) - If R+W <= N, weak/eventual consistency
"},{"location":"designdeck/#replication-vs-partition-impacts","title":"Replication vs. partition: impacts","text":"Replication: - Read-heavy - Availability > consistency
Partition: - Write-heavy (splitting up data across different shards)
"},{"location":"designdeck/#schema-on-read-vs-schema-on-write","title":"Schema-on-read vs. schema-on-write","text":"Schema-on-read: implicit schema but not enforced by the DB (also called schemaless but misleading)
Schema-on-write: explicit schema, the DB ensures all writes are conforming to it (e.g., relational DB)
"},{"location":"designdeck/#serializability","title":"Serializability","text":"I in ACID (strong isolation level)
Equivalent to serial execution (no interleaving due to concurrent transactions)
"},{"location":"designdeck/#serializable-snapshot-isolation-ssi","title":"Serializable Snapshot Isolation (SSI)","text":"Snapshot Isolation (SI) allows write skew
SSI is a stricter isolation level than SI preventing write skew: check at runtime for conflicts between transactions
Downside: increase the number of aborted transactions
"},{"location":"designdeck/#single-leader-multi-leader-leaderless-replication","title":"Single-leader, multi-leader, leaderless replication","text":""},{"location":"designdeck/#single-leader","title":"Single-leader","text":"All writes go through one leader
Pro: ensure consistency
Con: all writes go through a single node (bottleneck)
"},{"location":"designdeck/#multi-leader","title":"Multi-leader","text":"Rarely makes sense within a single datacenter (benefits rarely outweigh the added complexity) but used in multi-datacenter contexts
DB must resolve the conflicts in a convergent way
Use cases: - One leader per datacenter
Different topologies:
Most used: all-to-all
Pro: not limited to the write throughput of a single node
Con: possible write conflicts
"},{"location":"designdeck/#leaderless-replication","title":"Leaderless replication","text":"Client sends its writes to several replicas in parallel
Read requests are also sent in parallel to multiple replicas (this way, if a write hasn't been replicated yet to one replica, it won't lead to stale data)
Rely on read repair and anti-entropy mechanisms
Rely on quorum to know how long to wait for a request (not perfect: if a write fails because we didn't reach a quorum, what shall we do about the replicas where the write has already been committed)
Examples: Cassandra, DynamoDB, Riak
Pro: throughput
Con: quorums are not perfect, provide illusion of strong consistency when in reality, it's often not true
"},{"location":"designdeck/#sloppy-quorum","title":"Sloppy quorum","text":"In case of a quorum of w nodes to accept a write: if we can't reach w, the DB accepts the write replicate it to nodes that aren't among the ones on which the value usually lives
Relies on hinted handoff
"},{"location":"designdeck/#snapshot-isolation-si","title":"Snapshot Isolation (SI)","text":"Guarantee that all reads made in a transaction will see a consistent snapshot of the database
In practice, it reads the last committed values that existed at the time it started
Allows write skew
"},{"location":"designdeck/#snapshot-isolation-common-implementation","title":"Snapshot Isolation common implementation","text":"MVCC
"},{"location":"designdeck/#sstable","title":"SSTable","text":"Sorted String Table, immutable components of a LSM tree
Sorted immutable data structure
It consists of 2 components: index files and data files
The index (based on a hashtable or a B-tree) holds the keys and the data entries (offsets in the data file where the actual records are located)
Data files hold records in key order
"},{"location":"designdeck/#state-based-crdts-definition-and-requirements","title":"State-based CRDTs: definition and requirements","text":"Convergent replicated data types
Replication is made in propagating the full local state to replicas
States are merged with a function which must be: - Commutative - Idempotent - Associative => Update monotonically increase the internal state according to some partial order rules defined (e.g., max of two values, union of two sets)
=> Delivery layer doesn't have to guarantee causal ordering nor idempotency, only eventual delivery
"},{"location":"designdeck/#strong-eventual-consistency-definition-and-requirements","title":"Strong eventual consistency: definition and requirements","text":"Stronger guarantee than eventual consistency
Based on the fact that we can define a deterministic outcome for any conflict
Requires: - Eventual delivery: every update applied to a replica is eventually applied to all replicas - Strong convergence: guarantees that replicas that have executed the same updates have the same state (with eventual consistency, the guarantee is that the replicas eventually reach the same state, once consensus is reached)
Strong convergence requires convergent replicated data types (part of CRDT family)
Main difference with eventual consistency: - Leaderless replication - No consensus needed, instead, it relies on a deterministic outcome for any conflict
A solution to the CAP theorem
"},{"location":"designdeck/#three-phase-commit-3pc","title":"Three-phase commit (3PC)","text":"Failure-resilient refinement of 2PC
Unlike 2PC, satisfies liveness but not safety
"},{"location":"designdeck/#transaction","title":"Transaction","text":"A unit of work performed in a database system, representing a change, which can be potentially composed of multiple operations
"},{"location":"designdeck/#two-main-approaches-to-partition-a-table-that-has-secondary-indexes","title":"Two main approaches to partition a table that has secondary indexes","text":"Partitioning secondary indexes by document: - Each partition maintains its own secondary index - Write: one partition - Query on the index: requires querying multiple partitions (scatter/gather)
Optimized from writes
Example: Elasticsearch, MongoDB, Cassandra, Riak, etc.
Partitioning secondary indexes by term: - Global index covering all the partitions (to be replicated) - Write: multiple partitions are updated (for resiliency) - Query on the index: served from one partition containing the index
Optimized from reads
"},{"location":"designdeck/#two-types-of-crdts","title":"Two types of CRDTs","text":"Operation-based and state-based
Operation-based require less bandwidth
State based require less assumptions about the delivery layer
"},{"location":"designdeck/#two-phase-commit-2pc","title":"Two-phase commit (2PC)","text":"Protocol used to implement atomic transaction commits across multiple processes
Satisfies safety but not liveness
"},{"location":"designdeck/#wal","title":"WAL","text":"Write-ahead log (or redo log)
Append-only file to which every modification must be written
Used for restoration in the event of a DB crash: - Durability - Atomicity (allows to identify the operations on progress and complete or undo them)
"},{"location":"designdeck/#when-relational-vs-when-document","title":"When relational vs. when document","text":"Relational (schema-on-write): - Better support for joins - Many-to-one and many-to-many relationships - ACID
Document (schema-on-read): - Schema flexibility - Better performance due to locality - Closer to the data structures used by the application - In general not ACID - In general write-heavy
"},{"location":"designdeck/#when-to-use-a-column-oriented-store","title":"When to use a column-oriented store","text":"Because columns are stored contiguously: analytical workloads (computing average values, finding trends, etc.)
Flexible schema
Limited space (storing same data type together offers a better compression ratio)
"},{"location":"designdeck/#why-db-schemaless-is-misleading","title":"Why DB schemaless is misleading","text":"There is an implicit schema but not enforced by the DB
More accurate term: schema-on-read
Different from relational DB with shema-on-write where the schema is explicit and the DB ensures all written data conforms to it
Similar to dynamic vs. static type checking in a programming language
"},{"location":"designdeck/#why-is-in-memory-faster","title":"Why is in-memory faster","text":"Not necessarily because they don't need to read from disk (even a disk-based storage engine may never need to read from disk if enough memory)
Can be faster because they avoid the overhead of encoding in a form that can be written to disk
"},{"location":"designdeck/#write-and-read-amplification","title":"Write and read amplification","text":"Ratio of the amount of data written/read to the disk versus the amount of data intended to be written
"},{"location":"designdeck/#write-heavy-and-replication-type","title":"Write heavy and replication type","text":"Do not rely on single-master replication as it heavily impacts the scaling of write-heavy systems
Instead, rely on leaderless replication
Trade off: consistency is harder to guarantee
"},{"location":"designdeck/#design","title":"Design","text":""},{"location":"designdeck/#auditing","title":"Auditing","text":"Checking the integrity of data
"},{"location":"designdeck/#backward-vs-forward-compatibility","title":"Backward vs. forward compatibility","text":""},{"location":"designdeck/#bloom-filter","title":"Bloom filter","text":"Probabilistic, memory-efficient data structure for approximating the content of a set
Can tell if a key does not appear in the DB
"},{"location":"designdeck/#causality","title":"Causality","text":"Causal dependency: one event causing another
Happened-before relationship
"},{"location":"designdeck/#concurrent-operations","title":"Concurrent operations","text":"Not only operations that happen at the same time but also operations made without knowing about each other
Example: - Concurrent to-do list operations with a current \"Buy milk\" item - User 1 deletes it - User 2 doesn't have an internet connection, modifies it into \"Buy soy milk\", and then is connected again => this modification may have been done one hour after user 1 deletion
"},{"location":"designdeck/#consistent-hashing","title":"Consistent hashing","text":"Special kind of hashing such that when a resize occurs, only 1/n percent of the keys need to be rebalanced (n: number of nodes)
Solutions: - Ring consistent hash with virtual nodes to improve the distribution - Jump consistent hash: faster but nodes must be numbered sequentially (e.g., if we have 3 servers foo, bar, and baz => we can't decide to remove bar)
"},{"location":"designdeck/#design-impacts-of-sharing","title":"Design impacts of sharing","text":"May decrease: - Availability - Performance - Scalability
"},{"location":"designdeck/#design-read-heavy-vs-write-heavy-impacts","title":"Design: read-heavy vs. write-heavy impacts","text":"Read heavy: - Leverage replication - Leverage denormalization
Write heavy: - Leverage partition (usually) - Leverage normalization
"},{"location":"designdeck/#different-types-of-message-failure","title":"Different types of message failure","text":"Event log: - Consumers are free to select the point of the log they want to consume messages from, which is not necessarily the head - Log is immutable, messages cannot be removed by consumers (removed by a GC running periodically)
"},{"location":"designdeck/#exactly-once-delivery","title":"Exactly-once delivery","text":"Impossible to achieve
However, we can achieve exactly-once processing using a dedup or by requiring the consumers to be idempotent
"},{"location":"designdeck/#flp-impossibility","title":"FLP impossibility","text":"In an asynchronous distributed system, there's no consensus algorithm that can satisfy: - Agreement - Validity - Termination - And fault tolerance
"},{"location":"designdeck/#geohashing","title":"Geohashing","text":"Encode geographic coordinates into a short string called a cell with varying resolutions
The more letters in the string, the more precise the location
Main use case: - Proximity searches in O(1)
"},{"location":"designdeck/#hashing-definition-and-size-of-md5-and-sha256","title":"Hashing definition and size of MD5 and SHA256","text":"Map data of arbitrary size to fixed-size values
Examples: - MD5: 16 bytes - SHA256: 32 bytes
"},{"location":"designdeck/#hdfs","title":"HDFS","text":"Distributed filesystem: - Fault tolerant - Scalable - Optimised for batch operations
Architecture: - Single master (maintain filesystem metadata, inform clients about which server store a specific part of a file) - Multiple data nodes
Leverage: - Partitioning: each file is partitioned into multiple chunks => performance - Replication => availability
Read: communicates with the master node to identify the servers containing the relevant chunks
Write: chain replication
"},{"location":"designdeck/#how-to-reduce-sharing","title":"How to reduce sharing","text":"Used to approximate cardinality of a set
Optimization for space over perfect accuracy
"},{"location":"designdeck/#backing-idea","title":"Backing idea","text":"Coin flip game: you flip a coin, if head, flip again, if tail stop
If a player reaches n flips, it means that on average, he tried 2n+1 times
"},{"location":"designdeck/#algo","title":"Algo","text":"For an ID, we will count how many consecutive 0 (head) bits on the left
Example: 001110 => 2
Hence, on average we should have seen 22+1 visitors
Requirement: we need visitors ID to be uniform => either if the ID is randomly generated or by hashing them (if ID is auto incremented for example)
Required memory: log(log(m)) with m the number of unique visitors
Problem with this algo: it depends on luck. For example, if user 00000001 connects every day => the system will always approximate 28 visitors
"},{"location":"designdeck/#bucketing","title":"Bucketing","text":"Distribute to multiple counters and aggregate the results (possible because each counter is very small)
If we want 4 counters, we distribute the ID based on the first 2 bits
Result: 2(n1 + n2 + n3 + n4) / 4
Problem: mean is highly impacted with large outliers
Solution: use harmonic mean
"},{"location":"designdeck/#idempotent","title":"Idempotent","text":"If executed more than once it has the same effect as if it was executed once
"},{"location":"designdeck/#latency-numbers-every-programmer-should-know","title":"Latency numbers every programmer should know","text":"Lock with an expiry timeout after which the lock is automatically released
May lead to situations where two nodes believe they hold the lock (for example, when the expiry signal hasn't been caught yet by the first node because of a GC or CPU throttling)
Can be solved using a fencing token
"},{"location":"designdeck/#least-loaded-endpoint-load-balancing-strategy","title":"Least loaded endpoint load balancing strategy","text":"Not efficient
A more efficient option is to randomly pick two servers and route the request to the least-loaded one of the two
"},{"location":"designdeck/#liveness-property","title":"Liveness property","text":"Something good will eventually occur
Example: leader is elected, eventual consistency
"},{"location":"designdeck/#load-balancing","title":"Load balancing","text":"Route requests across a pool of servers
"},{"location":"designdeck/#load-shedding","title":"Load shedding","text":"Action to reduce the load on something
Example: when the CPU utilization reaches a threshold, the server can start returning errors
A more special form of load shedding is selective client throttling, where an application assigns different quotas to each of its clients
"},{"location":"designdeck/#locality","title":"Locality","text":"Performance optimization to put several pieces of data in the same place
"},{"location":"designdeck/#log","title":"Log","text":"Append-only, totally ordered sequence of messages
Each message is: - Appended at the end of the log - Is assigned a unique sequential index
Example: Kafka
"},{"location":"designdeck/#log-compaction","title":"Log compaction","text":"Throw away duplicate keys in the log and keep only the most recent update for each key
"},{"location":"designdeck/#main-drawback-of-shared-nothing-architectures","title":"Main drawback of shared-nothing architectures","text":"Reduce flexibility
If the application needs to access to new data access patterns in an efficient way, it might be hard to provide it given the system's data have been partitioned in a specific way
Example: attempting to query by a secondary attribute that is not the partitioning key might require to access all the nodes of the system
"},{"location":"designdeck/#mapreduce","title":"MapReduce","text":"Programming model for processing large amounts of data in bulk across many machines: - Map: processes a set of key/value pairs and produces as output another set of intermediate key/value pairs. - Reduce: receives all the values for each key and returns a single value, essentially merging all the values according to some logic
"},{"location":"designdeck/#microservices-pros-and-cons","title":"Microservices: pros and cons","text":"Pros: - Organizational (each team dictates its own release schedule, etc.) - Codebase is easier to digest - Strong boundaries - Independent scaling - Independent data model
Cons: - Eventual consistency - Remote calls - Harder to operate (more complex)
"},{"location":"designdeck/#number-of-values-to-generate-to-reach-50-chances-of-collision-32-bit-64-bit-and-128-bit-hash","title":"Number of values to generate to reach 50% chances of collision: 32-bit, 64-bit, and 128-bit hash","text":"Orchestration: single central system responsible for coordinating the execution
Choreography: no need for a central coordinator, each system is aware of the previous and the next
"},{"location":"designdeck/#outbox-pattern","title":"Outbox pattern","text":"Used to update a DB and publish an event in a transactional fashion
Within a transaction, persist in the DB (insert, update or delete) and insert at the same time a new row in an event table
Implements a worker that checks the event table, publishes an event and deletes the row (at least once guarantee)
"},{"location":"designdeck/#perfect-hashing","title":"Perfect hashing","text":"No collision, only possible if we know the keys up front
Given k elements, the hashing function returns an int between 0 and k
"},{"location":"designdeck/#quadtree","title":"Quadtree","text":"Tree data structure where each internal node has exactly four children: NE, NW, SE, SW
Main use case: - Improve geospatial caching (e.g., 1km in an urban area isn't the same as 1km outside cities)
Source: https://engblog.yext.com/post/geolocation-caching
"},{"location":"designdeck/#rate-limiting-throttling-definition-and-algos","title":"Rate-limiting (throttling): definition and algos","text":"Mechanism that rejects a request when a specific quota is exceeded
"},{"location":"designdeck/#token-bucket-algo","title":"Token bucket algo","text":"Token of a pre-defined capacity, put back in the bucket periodically:
"},{"location":"designdeck/#leaking-bucket-algo","title":"Leaking bucket algo","text":"Uses a FIFO queue When a request arrives, checks if the queue is full: - If yes: request is dropped - If not: added to the queue => Requests pulled from the queue at regular intervals
"},{"location":"designdeck/#rebalancing","title":"Rebalancing","text":"Move data or services from one node to another in order to spread the load fairly
"},{"location":"designdeck/#rest","title":"REST","text":"Architectural style where the server exposes a set of resources
All communications must be stateless and cacheable
Relies mainly on HTTP but not mandatory
"},{"location":"designdeck/#rest-vs-grpc","title":"REST vs. gRPC","text":"REST (architectural style): - Universality - Standardization (status code, ETag, If-Match, etc.)
gRPC (RPC framework): - Contract - Binary protocol (faster, less bandwidth) // We could use HTTP/2 without gRPC and leverage binary protocols but it would require more efforts - Bidirectional
"},{"location":"designdeck/#safety-property","title":"Safety property","text":"Something bad will never happen
Example: leader election eventually completes
"},{"location":"designdeck/#saga","title":"Saga","text":"Distributed transaction composed of a set of local transactions
Each transactions has a corresponding compensation action to undo its changes
Usually, a Saga is implemented with an orchestrator that manages the execution of the transactions and handles the compensations if needed
"},{"location":"designdeck/#scalability","title":"Scalability","text":"System's ability to cope with increased load
"},{"location":"designdeck/#scalability-ceiling","title":"Scalability ceiling","text":"Hard limit (e.g., device maximum throughput)
"},{"location":"designdeck/#shared-nothing-architectures","title":"Shared-nothing architectures","text":"Reduce coordination and contention so that every request can be processed independently by a single node or group of nodes
Increase availability, performance, and scalability
"},{"location":"designdeck/#source-of-truth","title":"Source of truth","text":"Holds the authoritative version of the data
"},{"location":"designdeck/#split-brain","title":"Split-brain","text":"Network partition => nodes unable to communicate with each other => multiple nodes believing they are the leader
As a node is unaware that another node is still functioning, it can lead to data corruption or data loss
"},{"location":"designdeck/#throughput","title":"Throughput","text":"The rate of work performed
"},{"location":"designdeck/#total-vs-partial-order","title":"Total vs. partial order","text":"Total order: a binary relation that can be used to compare any 2 elements of a set with each other
Partial order: a binary relation that can be used to compare only some of the elements of a set with each other
Total ordering in distributed systems is rarely mandatory
"},{"location":"designdeck/#uuid","title":"UUID","text":"128-bit number
Collision probability: after generating 1 billion UUID every second for ~100 years, the probability of creating a single duplicate reaches 50%
"},{"location":"designdeck/#validation-vs-verification","title":"Validation vs. verification","text":"Validation: process of analyzing the parts of the system and building mental models that reflects the interaction of those parts
Example: validate the quality of water by inspecting all the pipes and infrastructure to capture, clean and deliver water
Verification: process of analyzing output at a system boundary
Example: validate the quality of water by testing the water (output) coming from a sink
"},{"location":"designdeck/#vector-clock","title":"Vector clock","text":"Algorithm that generates partial ordering of events and detects causality violation
"},{"location":"designdeck/#why-asynchronous-communication","title":"Why asynchronous communication","text":"Reduce temporal coupling (not connected at the same time) => processes execute at independent rates, without blocking the sender
If the interaction pattern isn't request/response with client blocking until it receives the response
"},{"location":"designdeck/#http","title":"HTTP","text":""},{"location":"designdeck/#301-vs-302","title":"301 vs. 302","text":"301: redirect permanently
302: redirect temporarily
"},{"location":"designdeck/#403-or-404","title":"403 or 404?","text":"Retuning 403 can leak existence of a resource
Example: Apple is secretly working on super cars and creates an internal GET https://apple.com/supercar
endpoint
Returning 403 means the user doesn't have the rights to access the resource, but leaks the existence of /supercar
Small files stored on a user's computer to hold specific data (e.g., language preference)
Requests made by the browser will contain cookies data
Types of cookies: - Session cookies: only lasts for the duration of a session - Persistent cookies: outlast user session - Third-party cookies: used for advertising
"},{"location":"designdeck/#four-main-http2-features","title":"Four main HTTP/2 features","text":"HTTP live streaming: video streaming protocol
"},{"location":"designdeck/#http_1","title":"HTTP","text":"Request/response protocol used to encode and transport information between a client and a server Stateless (each request is executed independently)
The request and the response are 2 standard message types exchanged in a single HTTP transaction - Request: method, URL, HTTP version, headers, body - Response: HTTP version, status, reason, headers, body
Example of a POST request:
```http request POST https://example.com HTTP/1.0 Host: example.com User-Agent: Mozilla/4.0 Content-Length: 5
Hello ```
Application layer protocol (OSI level 7)
Relies on a transport protocol (OSI level 4, TCP most of the time but not mandatory) for error detection, flow control, reliability, etc.
"},{"location":"designdeck/#http-cache-control-header","title":"HTTP cache-control header","text":"Allows setting how long to cache a response
Part of the response header (hence, cached by the browser) but can be part of the request header too (hence, cached on server side)
If request header marked as private, the results are intended for a single user (then won't be cached by a load balancer for example)
"},{"location":"designdeck/#http-etag","title":"HTTP Etag","text":"Entity tag header that allows clients to make conditional requests
Server returns an ETag being the date and time of the last update of a resource
Client sends a If-Match
header to update a resource only if clients have the most recent version
Maintain a persistent TCP connection (reduces the number of TCP and HTTPS handshakes)
"},{"location":"designdeck/#http-methods-safeness-and-idempotence","title":"HTTP methods: safeness and idempotence","text":"Doesn't have any visible side effects and can be cached
"},{"location":"designdeck/#http-status-code-429","title":"HTTP status code 429","text":"When clients are throttled, the most common way is to return a 429 (Too Many Requests)
The response can also include a Retry-After header indicating how long to wait before making a new request (in seconds)
"},{"location":"designdeck/#http-status-codes","title":"HTTP status codes","text":"Source: https://github.com/alex/what-happens-when
"},{"location":"designdeck/#kafka","title":"Kafka","text":""},{"location":"designdeck/#consumer-types","title":"Consumer types","text":"Without consumer group: each consumer will receive all the messages in a topic
With consumer group: each consumer will receive a subset of the messages
Each consumer is assigned to multiple partitions (zero to many)
A partition is always assigned to only one consumer
If there are more consumers than partitions, some consumers will not be assigned to any partition (scalability ceiling)
"},{"location":"designdeck/#durabilityavailability-and-latencythroughput-tradeoffs","title":"Durability/availability and latency/throughput tradeoffs","text":"Source: https://developers.redhat.com/articles/2022/05/03/fine-tune-kafka-performance-kafka-optimization-theorem#kafka_priorities_and_the_cap_theorem
"},{"location":"designdeck/#log-compaction_1","title":"Log compaction","text":"Log compaction is a mechanism to give per-record retention to a topic
It ensures that Kafka will always retain at least the last message for each key of a given partition
A partition that is not yet compacted may have more than one message with the same key
Property: - retention.ms
: maximum time the topic will retain old log segments before deleting or compacting them (default: 7 days)
For low-throughput topic (topics with segments that should be rolled out because of segment.ms
rather than segment.bytes
), we should ensure that segment.ms is lower than retention.ms
A strictly increasing identifier per partition
"},{"location":"designdeck/#partition","title":"Partition","text":"Topics are divided into partitions
A partition is an ordered, immutable log of messages
No guaranteed ordering per topic with multiple partitions
Yet, the ordering is guaranteed per partition
"},{"location":"designdeck/#partition-distribution","title":"Partition distribution","text":"The client implements a partitioner based on the key (e.g., hash(key) % number of partitions)
This is not done on Kafka's side
If key is empty: round-robin
"},{"location":"designdeck/#rebalancing_1","title":"Rebalancing","text":"Not possible to decrease the number of partitions: topic has to be recreated
Possible to increase the number of partitions
Possible issue: no more guaranteed ordering as one key may be assigned to a different partition
"},{"location":"designdeck/#segment","title":"Segment","text":"Each partition is divided into segments
Instead of storing all the messages of a partition in a single file, Kafka splits them into chunks called segments A log segment is a file identified by the first message offset it contains
Properties: - segment.bytes
: maximum segment file size before creating a new segment (default: 1GB) - segment.ms
: period after which a new segment is created, even if the segment is not full (default: 7 days)
Distribute messages
All the consumers from one consumer group receive a portion of the messages
One partition is assigned to one consumer, one consumer can listen to multiple partitions
"},{"location":"designdeck/#math","title":"Math","text":""},{"location":"designdeck/#associative-property","title":"Associative property","text":"A binary operation is associative if rearranging the parentheses in an expression will not change the result
Example: +
is associative; e.g., (2 + 3) + 4 = 2 + (3 + 4)
A binary operation is commutative if changing the order of the operands doesn't change the result
Example: +
is commutative, /
isn't commutative
x1: probability of p1 (e.g. 0.5)
Less sensitive to large outliers
"},{"location":"designdeck/#network","title":"Network","text":""},{"location":"designdeck/#arp-protocol","title":"ARP protocol","text":"Map an IP address to a MAC address
"},{"location":"designdeck/#average-connection-speed-in-usa","title":"Average connection speed in USA","text":"42 Mbps
"},{"location":"designdeck/#backpressure","title":"Backpressure","text":"A node limits its own rate of sending in order to avoid overloading. Queueing is done on the sender side.
Also known as flow control
Example: TCP flow control
"},{"location":"designdeck/#bandwidth","title":"Bandwidth","text":"Maximum amount of data that can be transferred in a unit of time
"},{"location":"designdeck/#bgp","title":"BGP","text":"Border Gateway Protocol: Routing system of the internet
When a client submits data via the Internet, BGP is responsible for looking at all of the available paths that data could travel and picking the best route
Note: The chosen route isn't necessarily the fastest one, it can be the cheapest one. See https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-i.
"},{"location":"designdeck/#cors","title":"CORS","text":"Cross-origin resource sharing
Mechanism to allow restricted resources on a page to be requested from another domain outside the domain from which the resource was served
It extends and adds flexibility to SOP (Same-Origin Policy, same domain)
Example: User visits A and the page attempts to fetch data from B: 1. Browser sends a GET request to B with Origin header A 2. Server may respond with: - Access-Control-Allow-Origin (ACAO) header set to the domain A - ACAO set to a wildcard (*) indicating that the requests from all domains are allowed - An error if the server does not allow a cross-origin request
"},{"location":"designdeck/#difference-ping-heartbeat","title":"Difference ping & heartbeat","text":"Ping: sends messages to a process and expects a response within a specified time period (request-reply)
Heartbeat: a process is actively notifying its peers that it's still running by sending a message (notification)
"},{"location":"designdeck/#difference-tcp-udp","title":"Difference TCP & UDP","text":"A view is just an abstraction (SQL request is rewritten to match the actual schema)
A materialized view is a copy (written to disk)
"},{"location":"designdeck/#dns","title":"DNS","text":"Domain Name System: automatic translation between a name and an IP address
Notes: - Usually the local DNS configuration is the ISP one (config initialized from the router or static config) - The browser, the OS and the DNS resolver all use caches internally - A TTL is used to inform the cache how long the entry is valid
"},{"location":"designdeck/#dns-lookup-push-or-pull","title":"DNS lookup: push or pull","text":"DNS is based on the pull mode: - If record is present: DNS will return it - If record isn't present: DNS will pull the value, store it, and then return it
Notes: - New DNS records are immediate - DNS updates are slow because of TTL (there is no propagation, we wait for cached records to expire)
"},{"location":"designdeck/#health-checks-passive-vs-active","title":"Health checks: passive vs. active","text":"Passive: performed by the load balancer as it routes incoming requests (e.g., 503)
Active: the load balancer actively checking the health of the servers via a query to their health endpoint
"},{"location":"designdeck/#internet-model","title":"Internet model","text":"A network of networks
"},{"location":"designdeck/#layer-4-vs-layer-7-load-balancer","title":"Layer 4 vs. layer 7 load balancer","text":"Layer 4 is faster and requires less computing resources than layer 7 is but less flexible
Layer 4: look at the info at the transport layer to distribute the requests (source, destination, port)
Forward packet using NAT
Layer 7: look at the info at the application layer to distribute the requests (header, message, etc.)
Terminate the network traffic, read then open a connection to the target server
A layer 7 can de-multiplex individual HTTP requests where multiple concurrent streams are multiplexed on the same TCP connection
"},{"location":"designdeck/#mac-address","title":"MAC address","text":"A unique identifier assigned to a network interface
"},{"location":"designdeck/#max-size-of-a-tcp-packet","title":"Max size of a TCP packet","text":"64K
"},{"location":"designdeck/#mqtt-lwt","title":"MQTT LWT","text":"Last Will and Testament
Whenever a client is marked as disconnected (proper disconnection or heartbeat failure), it triggers to send a message in a particular topic
"},{"location":"designdeck/#ntp","title":"NTP","text":"Network Time Protocol: used to synchronize clocks
"},{"location":"designdeck/#osi-model","title":"OSI model","text":"7 layers: 1. Physical: transmission of raw bits over a physical link (e.g., USB, Bluetooth) 2. Data link: responsible from moving a packet of data from one node to a neighbouring node 3. Network: provides a way of sending packets between nodes that are not directly linked and might belong to other networks (e.g., IP, iptables routing) 4. Transport: application to application communication, based on ports when multiple applications on the same node wants to communicate (e.g., TCP, UDP) 5. Session 6. Presentation 7. Application: protocol of exchanges between the two sides (e.g., DNS, HTTP)
"},{"location":"designdeck/#routers","title":"Routers","text":"A way to connect networks that are connected with each other (used for the Internet)
Capable of routing packets properly across networks so that they reach their destination successfully
Based on the fact that an IP has a network prefix
"},{"location":"designdeck/#routers-buffering","title":"Routers buffering","text":"Routers use queuing (buffering) to address network congestion
A buffer has a fixed size and a fixed number of packets
If no available buffer: packet is dropped
Note: not a way to increase the throughput
"},{"location":"designdeck/#routers-processing","title":"Routers processing","text":"Per-packet processing, no buffering
Impacts: - It\u2019s faster to route 10 packets of 1000 bytes than 20 packets of 500 bytes - Sending small packets more frequently can fill the router buffer more quickly
Source: https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-i
"},{"location":"designdeck/#routing-table","title":"Routing table","text":"Example:
Destination Network mask Gateway Interface 0.0.0.0 0.0.0.0 240.1.1.3 if1 240.1.1.0 255.255.255.0 0.0.0.0 if1"},{"location":"designdeck/#service-mesh","title":"Service mesh","text":"All network traffic from a client goes through a process co-located on the same machine (sidecar)
Used to facilitate service-to-service communications
"},{"location":"designdeck/#switch","title":"Switch","text":"Receive frame and forward to specific links they are addressed to. Used for local networks.
Example: Ethernet frame
To do this, the switch maintains a switch table that maps MAC addresses to the corresponding interfaces that lead to them
At first, the switch table is empty. If the entry is empty, a frame is forwarded to all the interfaces (switches are self-learning)
"},{"location":"designdeck/#tcp-congestion-control","title":"TCP congestion control","text":"Determine dynamically the throughput (the number of segments that can be sent without an ack): - Increase exponentially for every segment ack - Decrease with a missed ack
Upon a new connection, the size of the window is set to a system default
It's one of the reasons why reusing a TCP connection leads to a performance increase
"},{"location":"designdeck/#tcp-connection-backlog","title":"TCP connection backlog","text":"SYN requests are queued before being accepted by a user-mode process
When there are too many requests for the process, the backlog reaches a limit and SYN packets are dropped (to be later retransmitted by the client)
"},{"location":"designdeck/#tcp-flow-control","title":"TCP flow control","text":"A receiver communicates back to the sender the size of the buffer when acknowledging a segment
Backpressure mechanism
"},{"location":"designdeck/#tcp-handshake","title":"TCP handshake","text":"3-way handshake - syn (sender to receiver) - syn-ack (receiver to sender) // ack the segment number received - ack (sender to receiver) // ack the segment number received
"},{"location":"designdeck/#websocket","title":"Websocket","text":"Communication protocol (layer 7) provides a full-duplex communication channel over a single TCP connection and bidirectional streaming capabilities
Different from HTTP but compatible with HTTP (starts as an HTTP connection and then is upgraded via a well-defined handshake to a TCP connection)
Obsolete with HTTP/2
"},{"location":"designdeck/#why-cant-we-rely-on-the-system-clock-in-distributed-systems","title":"Why can't we rely on the system clock in distributed systems?","text":"Provides guaranteed fault isolation by design
Based on the idea of partitioning a shared resource to isolate failures
"},{"location":"designdeck/#cascading-failure","title":"Cascading failure","text":"A process in a system of interconnected parts in which the failure of one or few parts can trigger the failure of other parts and so on
"},{"location":"designdeck/#causal-consistency-implementation","title":"Causal consistency implementation","text":"When a replica receives a new write, it doesn't apply it locally immediately. First, it checks whether the write's dependencies have been committed locally. If not, it waits until the required version appears.
"},{"location":"designdeck/#circuit-breaker","title":"Circuit breaker","text":"Used to prevent a network or service failure from cascading to other failures
Implemented on the client-side
Three states: - Closed: accept requests - Open: do not accept requests and fail immediately - Half-open: give the service another chance (can also be implemented using a probe)
The circuit can be opened when the health endpoint of the service is down or when the number of consecutive errors reaches a threshold
"},{"location":"designdeck/#exponential-backoff","title":"Exponential backoff","text":"Wait time increased exponentially after every retry attempt
"},{"location":"designdeck/#fault-tolerance","title":"Fault tolerance","text":"Property of a system that can continue operating correctly in the presence of failure of its components
"},{"location":"designdeck/#jitter","title":"Jitter","text":"Introduces a part of randomness to avoid synchronized retry spikes experienced during cascading failures
"},{"location":"designdeck/#knee-point","title":"Knee point","text":"Moment when linear scalability is not possible anymore
"},{"location":"designdeck/#phi-accrual-failure-detector","title":"Phi-accrual failure detector","text":"Instead of treating failure node failure as a binary problem (up or down), a phi-accrual failure detector has a continuous scale, capturing the probability of the monitored process's crash
Works by maintaining a sliding window, collecting arrival times of the most recent heartbeats
Used to approximate the arrival time of the next heartbeat and compute a suspicion level (how certain the failure detector is about a failure)
"},{"location":"designdeck/#retry-amplification","title":"Retry amplification","text":"Having retries at multiple levels of the dependency chain can amplify the number of retry
The deeper a service in the chain, the higher the load it will be exposed to due to amplification:
In case of a long dependency chain, perhaps we should only retry at a single level of the chain
"},{"location":"designdeck/#security","title":"Security","text":""},{"location":"designdeck/#authentication","title":"Authentication","text":"Process of determining whether someone or something is who or what it declares itself to be
"},{"location":"designdeck/#certificate-authorities","title":"Certificate authorities","text":"Organizations issuing certificates by signing them
"},{"location":"designdeck/#cipher","title":"Cipher","text":"Encryption algorithm
"},{"location":"designdeck/#confidentiality","title":"Confidentiality","text":"Process of protecting information from being accessed by unauthorized parties
Mainly achieved via encryption
"},{"location":"designdeck/#integrity","title":"Integrity","text":"The process of preserving the accuracy and completeness of data over its entire lifecycle, so that they cannot be modified in an unauthorized or undetected manner
"},{"location":"designdeck/#mutual-tls","title":"Mutual TLS","text":"Add client authentication using a certificate
"},{"location":"designdeck/#oauth-2","title":"OAuth 2","text":"Standard for access delegation
Process - Client gets a token from an authorization server - Makes a request to a server using the token - Server validates the token to the authorization server
Notes: some token types like JWT are self-contained, meaning the validation can be done by the server without a call to the authorization server
"},{"location":"designdeck/#public-key-infrastructure-pki","title":"Public key infrastructure (PKI)","text":"System for managing, storing, and distributing certificates
Relies on certificate revocation lists (CRLs)
"},{"location":"designdeck/#tls-handshake","title":"TLS handshake","text":"With mutual TLS:
One way: the session key is generated by the client
"},{"location":"designdeck/#two-main-uses-of-encryption","title":"Two main uses of encryption","text":"Encryption in transit
Encryption at rest
"},{"location":"designdeck/#two-types-of-encryption","title":"Two types of encryption","text":"Symmetric: key is shared between a client and a server (faster)
Asymmetric: two keys are used, a private and a public one - Client encrypts a message with the public key - Server decrypts the message with its private key
"},{"location":"designdeck/#what-does-digital-signature-provide","title":"What does digital signature provide","text":"Integrity and authentication
"},{"location":"designdeck/#what-does-tls-provide","title":"What does TLS provide?","text":"