In a distributed system, there is no shared memory but only message passing – it’s a fundamental difference from a shared memory non- distributed system.
Transparency
access transparency: local and remote resources are accessed using the same operations.
location transparency: remote resources are accessed using location independent names.
concurrency transparency: processes can access resources without interfering with each other.
failure transparency: failures are concealed for the users of a resource.
replication transparency: users of a resource access it as if it was not replicated.
Network
quality of service
reliability
security
performance
Internet, network
A network: a hardware and software data communication system that provides interconnection of computers and other devices.
The Internet*: Internet is a set of networks connected with routers. The largest internet that includes commercial, military, university, and other networks with different physical links and various protocols, including IP (Internet Protocol)
client-server, peer-to-peer
client-server: client is the active part.
peer-to-peer: all nodes are active and can be the initiator of operations.
synchronous, asynchronous
synchronous: upper bound of time to perform an operation is known.
asynchronous: time to perform an operation does not have an upper bound.
difference: possibility of perfect failure detectors.
UDP, TCP
UDP
small message that should be sent with little delay.
best effort delivery of messages to a process.
TCP
large messages or a sequence of messages.
full-duplex stream between two processes.
port numbers can be found in the TCP header.
window size a limiting factor for high capacity communication: over long fat communication links.
active/passive replicated server
active replicated server
Multiple servers actively process client requests simultaneously.
Each server maintains an identical copy of the state and is capable of processing client requests independently.
The servers continually communicate to ensure that their states remain consistent. Replicas are in a consistent state because they reliably receive all requests in a total order.
passive replicated server
Only one server (the primary or master) actively processes client requests, while the others (the backups or replicas) remain in a standby mode.
The primary server processes client requests and updates its state accordingly.
The state updates are then replicated to the backup servers to ensure that they have an up-to-date copy of the data. Use view-synchronous group communication so that all or none of the backup servers have received an update before a new primary is elected.
If the primary server fails, one of the backup servers is elected as the new primary, and it takes over processing client requests.
RPC(remote procedure call)
invocation semantic: at least once
An idempotent operation is an operation that can be performed repeatedly with the same effect as if it had been performed exactly once.
Java RMI(remote method invocation)
argument pass: remote objects as reference, all other as copies.
reason: the object contains a mutable state that should not be duplicated.
invocation semantic: at most once
level of transparency: location transparency
At-most-once or At-least-once
At-most-once
The request has been executed once.
For non-idempotent operations.
Implemented using a history or simply not resending requests.
If an operation fails, the request might have been executed at most once.
At-least-once
The request has been executed at least once.
For idempotent In re-sending.
No need for a history; simply resend requests until a reply is received.
If an operation fails, the request might have been executed at least once.
Erlang
process communicate: asynchronous message passing
provide location transparency: use a process identifier without having to know the address of the node where the process lives.
destination in send operation: process identifier or a local or remote registered name.
circular construction: processes can refer to each other in a circular way.
File system
Descriptors, table entries and i-nodes
i-node: Inode holds the metadata for this file and, traditionally, the pointers to the disk blocks where the file resides on the hard drive.
soft link, hard link
soft link: a path that is resolved to another path.
hard link: a mapping of a name to a file identifier.
NFS(Network File System)
client-side cache entry validation: if the validity was checked less than t seconds ago or if the server modification time is equal to the client modification time.
authentication control: NFS is a stateless server, so user’s identity and access right must be checked by the server on each request.
what position to read and write an opened file: each read and write operation holds the position.
AFS
client side caching implementation: server promise to notify the client if a file is modified by another client. Client receive a call-back promise. The file is valid if the promise exists and is not too old.
inconsistent view of a file: when callback message is lost, an old version of a file may be opened after it has been updated by another client.
URL, URN, URI
URL and URN are subsets of URI.
URL: location and access method of resource.
URN: assign a unique name for resource, regardless of location.
DNS
inconsistencies of the resolver cache: each entry has a time-to-live.
inconsistent cached entries removed: all entries have an expiration time set when cached.
Time and Clock
Christian’s algorithm
Berkeley algorithm
perform internal synchronization.
Leader Selection: One of the nodes in the network is designated as the leader or coordinator. This leader is responsible for coordinating the time synchronization process.
Time Measurement: The leader periodically sends a time request message to all other nodes in the network. Each node records its current local time and sends it back as a time response message to the leader.
Time Calculation: The leader collects the time response messages from all nodes. It then calculates the average time offset (difference) between its own clock and the clocks of all other nodes.
Time Adjustment: The leader sends time adjustment messages to all nodes, instructing them to adjust their local clocks based on the calculated average offset. This adjustment aims to bring the clocks into approximate alignment.
Clock Adjustment: Each node, upon receiving the time adjustment message, adjusts its local clock accordingly.
NTP(Network Time Protocol)
Client-Server Model: computers (clients) periodically request time information from NTP servers. The servers respond with highly accurate time data.
Clock Synchronization: The NTP client compares the time received from the server with its own local clock. It then applies a correction to adjust its clock to match the server’s clock more closely.
Primary servers are connected directly to a time source; secondary servers are synchronized with primary servers. The servers are connected in a logical hierarchy called a synchronization sub-net, whose levels are called strata. Primary servers occupy stratum 1: they are at the root. Stratum 2 servers are secondary servers that are synchronized directly with the primary servers; stratum 3 servers are synchronized with stratum 2 servers, and so on.
Lamport, Vector
only the vector clock gives a complete description of the happened before order.
happened-before
A happened before B ⇒ A must have occurred in real-time before B.
A occurred in real-time before B ⇒ A could have happened before B.
A caused B ⇒ A happened before B.
Lamport
if L(a) < L(b) then a could have caused b
if L(a) ≮ L(b) then a did not happened before b
when implementing mutual-exclusion: prevent to deadlock
Vector
problem: a dynamic set of processes.
Global State
A global state corresponds to the initial prefixes of the individual process histories.
A *cut **is a subset in the global history up to a specific event in each history.
consistent cut: if e is in the cut and f happened-before e then f is in the cut. A consistent cut ***corresponds to a **consistent global state.
A run ***is a total ordering of all events in a global history **consistent with each local history.
A linearization **or **consistent run **is a run that describes transitions between **consistent global states.
stable global state: if a system enters a state where the predicate holds true it will remain true in all future states.
example: deadlock.
unstable global state: the predicate could hold true in a state but then be false in future states.
global state not local node
Deadlock Detection: If we do not compare the global state with the local state of the nodes, we could determine a deadlock which isn’t really one (“phantom deadlock”).
Garbage Collection: In distributed garbage collection it could happen, that we remove an object whose reference is still on the way to the receiver.
Lost Token Detection: The liveness of ring-based networks mainly depends on the availability of the token. The token is passed further by each node in the ring. If we capture a global state while the token is on the way from one node to the other, we could mistakenly determine the loss of the token.
Chandy and Lamport algorithm
there is a linearization from the original state to the final state that passes through the snapshot.
itiation of Snapshot: The snapshot algorithm begins when a process, called the Initiator, decides to take a snapshot. The Initiator does not inform other processes about this decision.
Marker Messages: The Initiator initiates the process by sending a special marker message along all outgoing channels. When a process receives a marker message for the first time, it records the state of its local variables (this is the “local snapshot”).
Snapshot Collection: Once has received marker messages on all incoming channels, it has captured a consistent global snapshot.
Coordination
Broadcast: No one keeps track of who listens in a broadcast service.
Multicast: In a multicast service, the sender sends a message to a specific group; the system keeps track of who should receive it.
Bully algorithm
better turnaround time.
Leader Detection Trigger: The algorithm starts when a process detects that the current leader has failed or when a new process joins the network and needs to elect a leader.
Initiating an Election: When a process detects the absence of a leader, it initiates an election process by sending an election message to all processes with higher IDs.
Receiving OK Messages: If the initiating process receives an “OK” message from all processes with higher IDs, it becomes the new leader and sends a victory message to inform all processes of its leadership.
ring-based election
nodes can easily change priority in-between elections.
Initiation: The election process is initiated when a triggering event occurs, such as the failure of the current leader node or a request to elect a new leader.
Message Passing: The initiation node (initiator) sends an “election” message containing its own identifier to its successor node. The message includes the initiator’s identifier and a flag indicating that an election is in progress.
Processing at Nodes:
When a node receives an “election” message, it compares the received identifier with its own.
If the received identifier is greater, the node updates the message with its own identifier and continues to pass it along to the next node.
If the received identifier is smaller, it means that the message has completed the full circle and reached the initiator, signifying the end of the election.
Election Completion: Once the initiator receives the message back with its own identifier, it knows that it has the highest identifier and declares itself the new leader.
reliable multicast
messages are guaranteed to be delivered to all correct processes.
Implement a reliable multicast using only basic multicast: by resending each received message to all other nodes.
FIFO
Messages are delivered to processes in the same order in which they were sent.
This means that if process A sends message X and then message Y to process B, process B will receive and process X before Y.
casual order
It ensures that messages are delivered in an order that respects the causal relationships between events in the system.
If event A causally precedes event B, then any message related to A must be delivered before any message related to B.
Causal ordering does not strictly imply FIFO.
total order
In total order, every process receives messages in the same order. This ensures that all processes agree on the exact sequence of message deliveries.
Uniform agreement
Uniform agreement: if any node, correct or incorrect, delivers a message, then all correct nodes will deliver the message.
Non-uniform agreement: if a correct node delivers a message, then all correct nodes will deliver the message.
Transaction
problem
dirty read: reading a value that has not been committed.
premature writes: writing a value that has not been committed.
lost update: both transactions read the same object and attempt to update the same data concurrently, and one of the updates is lost, resulting in an incorrect final state.
inconsistent retrieval: both transactions write to the same object and different transactions reads different value for the same object.
serially equivalent
The combined effect is the same as if the transactions had been performed one at a time in some order is a serially equivalent interleaving.
The use of serial equivalence as a criterion for correct concurrent execution prevents the occurrence of lost updates and inconsistent retrievals.
Two transactions are serially equivalent if, and only if, all pairs of conflicting operations of the two transactions are executed in the same order on all the objects they both access.
Execution Phase: The transaction performs its operations on the data it has read. This includes any updates, deletions, or insertions.
Validation Phase: Once the transaction has completed its operations, it must validate that no other transaction has modified the data it read before it commits.
Conflict Resolution: If a conflict is detected during validation, the system must resolve it. This can be done by: Abort the current transaction and restart it, allowing it to re-read the data and try the operations again.
Commit Phase: If no conflicts are detected during validation, the transaction is ready to commit. It applies its changes to the database.
two-phase locking
Growing Phase: In this phase, a transaction can acquire locks but cannot release any. Once a transaction releases a lock, it cannot acquire any more locks.
Shrinking Phase: After a transaction releases its first lock, it enters the Shrinking Phase. In this phase, a transaction can release locks but cannot acquire any more.
two-phase commit
phase one (voting): ask participants to vote for commit or abort
if voting for commit, one has to be able to commit even after a node crash
if anyone aborts, all must abort
phase two (completion): inform all participants of the result
forward/backward validation
Forward validation: checks the transaction undergoing validation with other later(future) transactions, which are still active.
Backward validation: checks the transaction undergoing validation with other preceding(past) overlapping transactions – those that entered the validation phase before it.
Timestamp ordering
Each operation in a transaction is validated when it is carried out
The timestamp defines its position in the time sequence of transactions.
A transaction’s request to write an object is valid only if that object was last
read and written by earlier transactions.
A transaction’s request to read an object is valid only if that object was last written by an earlier transaction.
acid
atomicity: either all or no operations in the transaction are performed
isolation: intermediate results must not be visible to other transactions
durability: the effects of the transaction will remain even if we have a server crash
Group Membership
view: the set of processes belonging to a group.
a process that enters the group will deliver: all messages starting from the view when it enters.
different from reliable multicast: a message is sent and delivered only by processes belonging to the same view.
DHT(Distributed Hash Table)
limit the number of hops to reach any key: at most log(N)
leaf set: shortcut the look-up when is close to the destination key.
leaf set includes nodes less than self and nodes greater than self.
Mutual exclusion
Ricart and Agrawalas altorithm
Requesting Access: When a process wants to access a critical section (a shared resource), it sends a “request” message to all other processes in the system.The “request” message includes the process’s unique identifier and a timestamp representing the order in which requests were made.
Receiving Requests:
When a process receives a “request” message from another process, it compares the received request’s timestamp with its own.
If the received request has a higher priority (earlier timestamp), the receiving process defers.
If the received request has a lower priority (later timestamp), the receiving process continues processing and does not grant access to the critical section.
Central Solution
In a central solution, there is a single central entity that has the authority to make decisions and coordinate activities.