All Files in ‘COSC264 (2020-S2)’ Merged

01. Introduction

Fundamental design principals of data networks & design
Networked applications, socket programming
Selected data transmission, LAN
Internet architecture and algorithms
Routing protocols and algorithms
Transport layer: UDP, TCP

Term 3 quizzes and assignments:

25%
4 lab quizzes: 2% each
- Week 1: problem sheet + optional quiz
- Week 5: problem sheet
Super quiz: packet processing, 7% during week 3
Assignment: socket programming, 10%
- Sunday August 16 11:59 pm

Mid-term test:

25% (electronic)
Friday September 11 7:00-8:30 pm
Lectures and lab from term 3

Term 4 quizzes:

4 lab quizzes, 2% each

Lab test:

17%
Friday October 9 7:00-8:00 pm
Term 4 lab content

Final exam:

25%
Term 4 content only
90 minutes

02. Sockets

Ethernet/WLAN/PPP etc. -> Internet Protocol -> TCP/DNS/HTTP etc.

Everything over IP, IP over everything.

IP Address

Each host (not just end hosts) is identified by one or more IP addresses
- Usually as many IP addresses as network adapters
- End hosts usually have one
- Exceptions e.g. laptops with Wi-Fi and ethernet
Both identifies the host and helps with routing
DNS used to translate human readable host names to IPv4 addresses

Representation

32-bit wide address
Worldwide unique (with caveats)
Dotted-decimal notation: 192.168.1.1

IP Service - Best Effort

The basic IP service is packet delivery. It is:

Stateless: no connection/shared state
Unacknowledged: the IP service does not send acknowledgement that a package has been received
Unreliable: no retransmissions on the IP level
Unordered: does not guarantee in-sequence delivery

UDP and TCP

UDP: user datagram protocol
TCP: transmission control protocols
Both operate on top of IPv4: they generate packets and use that as the IPv4 packet payload: protocol layering
Both are unicast: between a pair of IP addresses
They add their own addressing capabilities: port numbers
- IPv4 address refers to end host; port number refers to particular application running on the end host

UDP

IPv4, plus port numbers. It is:

Connectionless
Unacknowledged
Unreliable
Unordered

TCP

TCP allows safe and reliable transfer over the internet. It is:

Connection-oriented
- Different from a connection in a circuit-switched network: resources are NOT reserved
- Goes through the same three phases of a connection: setup, use and teardown
Reliable and in-sequence
Byte-stream oriented
Full-duplex

Ports

16-bit port number
A process can allocate one or more ports
One port is allocated to at most one process
Some ports, well-known ports, are allocated to well-known applications
- 22: SSH; 25: SMTP; 53: DNS; 80: HTTP; 443: HTTPS
- This is convention, not a rule
- Ports 1023 and below are usually reserved for system services
- Admin permissions required in Linux

Processes and blocking socket calls

An OS creates an abstraction for processes to make processes think that they have their own exclusive processor and allocate memory exclusively (memory not accessible by other processes). Below the abstraction, time-sharing is used. A process can be:

Running: running on the processor
- A process will be given some quantum of time to run, after which it will be swapped out with another runnable process
Runnable: ready to run on the processor, but not currently running
Blocked: waiting for external input e.g. file from disk, socket
- Blocked processes use no CPU resources
- Much better than busy-looping (polling)
- NB: of one thread calls a blocking call, the whole process will be blocked

Socket API

Follows the Unix philosophy that (almost) everything is a file (or acts like a file)
One socket is bound to exactly one port
- Within the same process, multiple sockets can be bound to the same port
Associated with its underlying protocol (e.g. UDP or TCP)
- Application does not need to know the details of the abstraction, but needs to specify which one to use
Associated with buffers - decouples the application from the underlying data protocol
- Receiver’s TCP/UDP entity places incoming data into the receiver buffer; process will read() from the buffer at its own discretion
- Transmitter write()s data into the transmit buffer; TCP/UDP sends it at its own discretion
When writing to a socket:
- UDP: data encapsulated in UDP datagram and transferred - one write, one datagram
- TCP: data buffered and may be combined with data from previous/future write()s before being transmitted
- Successful write means data was successfully put into the buffer

Client/Server Paradigm

Client applications must:

Know the server IP address and port number
Have its own IP address and port number
- Have a socket bound to these
- Through the socket, it can initiate contact with the server

Server applications must:

Have an IP address and port number
- Must be known to clients
- Must have a socket bound to these
Accept service requests from clients and respond to them

The client and server can run on the same machine: 127.0.0.1 (localhost) maps a device to itself.

Socket Types

Differ by OS, but these are universal:

Stream socket: TCP (usually)
- Reliable, in-sequence transmission of a byte stream
- The application does know what protocol is being used under the hood - usually TCP, but not guaranteed
- If delivery fails (for a prolonged time), the sending application is notified
- i.e. receiver not responding
- A write() does not guarantee a packet - the underlying protocol will determine on its own if it should send a packet or combine it with data from other write()s
- Receiver cannot detect the boundary between write()s - record boundaries not preserved
Datagram socket: UDP (usually)
- Does not guarantee reliable or in-sequence delivery
- If delivery fails, sending application is not notified
- One write() of x bytes, one UDP packet
- A read() will return a block of at most x bytes (when no bits are lost)
- Record boundaries are preserved

Example Workflow: TCP client

socket()
- Creates a socket, allocates resources (buffer, socket handle etc.)
- Initially assigned to a random, unused port number
- Can fail (e.g. lack of buffer space, not enough handles, need root)
connect()
- Specify IP address and port number of server
- (Tries to) establish connection with the server
read()/write()
- Or sendto()/recvfrom() - additional parameters
- Reads from/sends data to the server
close
- Closes the connection; frees up resources

Example workflow: TCP Server

socket() creates socket
bind() binds to a particular port number and IP
listen() declares the willingness to accept incoming connections and allocate resources for incoming connection requests
- Buffer size specified
accept()
- Accept an incoming connection request
- Takes the oldest connection request from the queue
- Generates a new socket with the same port number (as the socket listen() is using)
- Allows a separate thread to process it for parallel processing
On the new socket:
- read(), write()
- close()
close() on the initial socket - to stop accepting all connections

Example workflow: UDP client

socket()
rcvfrom(), sendto()
- sendto requires you to explicitly specify the receiver address
- rcvfrom gives you the IP address of the sender as well as the data
close() connect(), then read()/write() can also be used. connect() gives you a default sender and receiver IP address, and can be called multiple times.

Example workflow: UDP server

socket()
bind()
rcvfrom(), sendto()
close()

Socket API Functions

#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>

Returns an integer. If negative, error, and sets the errno variable.

Socket (non-blocking)

Creates a new socket structure
Allocates resources e.g. socket buffer
Assigns to random un-used port number
Returns file descriptor representing the socket

int socket(int domain, int type, int protocol);.

domain: type of networking technology e.g. AE_INET (IPv4), AF_NET6 (IPv6)
type: type of socket e.g. SOCK_STREAM (TCP-like), SOCK_DGRAM (UDP-like), SOCK_RAW
protocol: protocol for given socket protocol. Usually only one sensible option. If in doubt, set to 0

If successful, returns file descriptor (positive integer). If fails, returns -1 and sets errno.

Bind (non-blocking)

Links a socket to a particular IP address/port number/address family combination

int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

sockfd: socket descriptor (returned from socket())
addr: pointer to struct containing address information. If IPv4, need to cast sockaddr_in to sockaddr
addrlen: length of address

Tries to be general, so is big enough to fit in any type of address. sa_family is the same value as used in protocol.

struct sockaddr {
 sa_family_t sa_family;
 char sa_data[14];
}

Need to cast sockaddr_in to a char array.

struct sockaddr_in {
 short sin_family;
 unsigned short sin_port;
 struct in_addr sin_addr;
 char sin_zero[8];
};

struct in_addr {
 unsigned long s_addr;
};

Listen (non-blocking)

int listen(int sockfd, int backlog);

Declares that you are willing to accept incoming steams (SOCK_STREAM or SOCK_SEQPACKET connections), and allocates resources (queue).

backlog: how long the queue can be

Accept (blocking)

Waits for incoming connection requests
Return a new socket (with same port number) over which you can talk to the sender, and returns the socket descriptor

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

addr: pointer to address struct. It will fill in the address details of the sender

Connect (blocking)

int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

For steam sockets: request establishment of connection with server
For datagram sockets: used to specify default receiver for datagrams when send() or write() used

Fate of connection setup must be known.

Blocks until the fate of the connection setup is known:

For stream sockets, the connection must be accepted using accept and that message must be received
For datagram sockets, there is no confirmation of setup or anything, so it is non-blocking

Read

ssize_t recv(int sockfd, void *buf, size_t len, int flags); ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen);
ssize_t read(int fd, void *buf, size_t count);

read: prepare buffer area, size of the buffer area, and pass it to read, which will fill it up with data.

recvfrom: the caller also provides memory for address structure and the function fills it in with the address/port/address family of the host.

Read will be blocking if there is no data (unless a flag is set using recv/recvfrom, in which case it will return an error code).

Write

ssize_t send(int sockfd, const void *buf, size_t len, int flags);
ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen);
ssize_t write(int fd, const void *buf, size_t count);

write: pass it a buffer of data to send, and the number of bytes to send.

If the underlying socket buffer is too full, it will be blocking (unless the flag is set, which will cause it to return an error code).

send/write require the addressee to be specified using connect.

Close

int close(int fd);

Frees up socket resources.

Endianness

For certain data types (e.g. 16 bit unsigned port number), the packet header in particular, there must be a canonical representation for data being transmitted - the network byte order. Hosts must convert between their own internal representation (host byte order) and network representation.

The canonical representation for the internet is big-endian (most significant byte at lowest memory address). Intel hates it.

Host (h) to network (n) conversion, or vice versa, available as helper functions for 16/32 bit integers.

#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong); // host to network long (32 bits)
uint16_t htons(uint16_t hostshort); // host to network short (16 bits)

uint32_t ntohl(uint32_t netlong); // network to host long
uint16_t ntohs(uint16_t netshort); // network to host short

Helper functions for address manipulation.

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int inet_aton(const char *cp, struct in_addr *inp);
in_addr_t inet_addr(const char *cp);
in_addr_t inet_network(const char *cp);
char *inet_ntoa(struct in_addr in);
struct in_addr inet_makeaddr(int net, int host);
in_addr_t inet_lnaof(struct in_addr in);
in_addr_t inet_netof(struct in_addr in);

TCP Client Example

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>

int main() {
  char[] server_name = "localhost";
  uint16_t portno = 80;
  struct sockaddr_in_serv_addr; // Address structure for server
  struct hostent *server; // Result of DNS resolver
  char[] buffer[256] = "A message to send"; // Buffer for data to send

  // Create socket with protocol 0 (default) for given address family - TCP
  int sockfd = socket(AE_INET, SOCK_STREAM, 0);
  if (sockfd < 0) {
    perror("ERROR opening socket");
    exit(1);
  }

  server = gethostname(server_name); // DNS resolution
  if (server == NULL) {
    perror("ERROR no such host");
    close(sockfd); // Not needed, but stylistically bad to not close the socket
    exit(0);
  }

  bzero((char*) &serv_addr, sizeof(serv_addr)); // Zero out the struct
  serv_addr.sin_family = AE_NET;

  // Copy IP address from DNS response
  bcopy((char*)server->h_addr,
    (char*)&serv_addr.sin_addr.s_addr,
    server->h_length
  );

  // Set port number
  serv_addr.sin_port = htons(portno);

  if (connect(sockfd, (struct sockaddr*) &serv_addr, sizeof(serv_addr)) < 0) {
    perror("ERROR connecting");
    close(sockfd);
    exit(1);
  }

  int n = write(sockfd, buffer, strlen(buffer));
  if (n < 0) {
    perror("ERROR writing to socket");
    exit(1);
  }

  bzero(buffer, 256); // Zero out buffer; wait for response
  n = read(sockfd, buffer, 255);
  if (n < 0) {
    perror("ERROR reading from socket");
    exit(1);
  }

  printf("%s\n", buffer);

}

TCP Server Example

int main() {
  int portno = 80;
  int sockfd = socket(AE_INET, SOCK_STREAM, 0);

  struct sockaddr_in serv_addr;
  struct sockaddr_in cli_addr;

  if (sockfd < 0) {
    perror("ERROR opening socket");
    exit(1);
  }


  bzero((char *) &serv_addr, sizeof(serv_addr));
  serv_addr.sin_family = IENET;
  serv_addr.sin_addr.s_addr = INADDR_ANY; // Bound to all local interfaces. Otherwise, use the IP address of a specific interface
  serv_addr.sin_port = htons(portno);

  if (bind(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {
    eprint("ERROR on binding");
    exit(1);
  }

  listen(sockfd, 5); // Allow maximum of 5 queued onnections

  clilen = sizeof(cli_addr);
  // Wait (blocking) for incoming connection request with `accept`
  int newsockfd = accept(sockfd, (struct sockaddr *) &cli_addr, &clilen);

  if (newsockfd < 0) {
    eprint("ERROR on accept");
    exit(1);
  }

  bzero(buffer, 256);
  n = read(newsockfd, buffer, 255);
  if (n < 0) {
    eprint("ERROR reading from socket");
    exit(1);
  }

  printf("%s\n", buffer);
  n = write(newsocketfd, "Message received", 17);
  if (n < 0) {
    eprint("ERROR writing to socket");
    exit(0);
  }

  close(newsockfd);
  close(sockfd);
  return 0;
}

03. Protocol Layering

Layering

The internet and POTS are complex and require vast amounts of software. Hence, structuring principals are used to achieve:

Modularity, allowing software re-use
Independence of network technologies (transparency)
Separation of concerns
Correctness

Networking software is layered - layer $N$ offers an interface to layer $N + 1$ , offering it using only the service offered by layer $N - 1$ (through its interface).

The SAP is a service access point - an interface through which layer $N$ will provide a service to layer $N + 1$
- Layer $N$ can provide multiple SAPs
- e.g. UDP: the SAP will be the socket API (and the associated port number)
Layer $N$ ’s protocol entity is the implementation of layer $N$ . The protocol entity cannot communicate with the other host’s protocol entity (at the same level) directly; instead, it must use the service provided by layer $N - 1$ to communicate with the other host indirectly
- However, the protocol entity can act as if this is the case
Layer $N$ must never make any assumptions about what layer $N + 1$ is doing

General layout of Layer $N$ -PDU (Protocol Data Unit)

| N-protocol header | N + 1 data (N-SDU) | N trailer |

The $N$ -PDU/packet is constructed by the $N$ -protocol entity
It carries the data given by layer $N + 1$ for transmission; called the service data unit/payload/user data
The $N$ -protocol entity adds a header containing control information relevant to the receiving $N$ layer
- It may also add a trailer (usually a checksum)

Thus, going down the stack, each layer sandwiches the $N + 1$ payload between its own header and trailer:

| N-2 hdr | N-1 hdr | N hdr | N+1 data | N trl | N-1 trl | N-2 trl|

The receiver does the opposite, with each layer removing its header/trailer before handing the payload to the upper layer.

OSI Seven Layer Model

OSI: Open Systems Interconnection. A commercial failure, but the terminology and layering architecture/concepts are foundational to networking.

Physical layer
Link layer
Network layer
Transport Layer
Session layer
Presentation layer
Application layer

If there are two end hosts, A and B, and a router in between:

The lowest two layers, physical and link, are single-hop scope, exchanging PDUs only between two physically connected hosts
The network layer uses hop-by-hop communication to achieve end-to-end communication
Routers work only on the lowest three layers
- Routers are stupid; intelligence in the end hosts
The upper four layers, network, transport, session and presentation, exchange PDUs between end hosts; purely concerned with end-to-end transmission

Physical (PHY)

Transmission of digital data over a physical medium using modulated waveforms/signals.

Often requires the specification of:

Cable types/frequencies
Connectors
Electrical specifications
Modulation/demodulation and signal specification
Carrier or bit synchronization methods

Link layer

Responsible for reliable transfer of messages over a physical link. Messages at this layer are called frames.

Often requires the specification of:

Framing: delineation of frame start/end, frame size, frame format
Error control (coding or retransmission)
- Error-correction coding sometimes regarded as PHY functionality
Medium access control
- Allocating the right to send on channels with multiple senders
Flow control: avoiding overwhelming a slow receiver

Network layer

Provides link technology-independent abstraction of the entire network to other layers. Handles addressing and routing and end-to-end delivery of messages.

Messages at this layer and above are called packets.

Often requires the specification of:

Addressing format
Exchange of routing information and route computation
Depending on technology: establishment, maintenance and teardown of connections

Transport Layer

Provides reliable, in-sequence and transparent end-to-end transfer; gives programming abstractions to higher layers.

Often requires the specification of:

Error control procedures (again)
- Link layer protects against transmission error; the router can still lose the packet (e.g. if it has no resources available)
Flow control procedures (again)
- Link layer does flow control for stations one hop apart; transport layer does end-to-end flow control
Congestion control procedures (sometimes considered a network-layer issue)
- Protect network against overloading

Session & Presentation Layers

No one really knows what these are.

Session layer:

Establishes communication sessions between applications
Can involve several transport layer connections between the same two endpoints to increase data transfer rates. They may run either sequentially or in parallel
May control the way two partners interact (e.g. alternatingly)

Presentation layer:

translation between different representation of data types (endian conversion)

Application Layer

Application support functions (high-level APIs) used in many applications. Applications sit on top of this layer.

Examples include file transfer services, directory services, transaction processing support (e.g. two-phase commit)

TCP/IP Reference Model

Broadly equivalent to OSI model, but with the session and representation layers removed.

Physical Layer
Network Interface
Internet
Transport Layer
Application

Once again, routers only implement the lowest three layers, which are kept simple. This allows many things to be implemented in hardware, allowing routers to keep up with the deluge of packets.

Application Layer

Protocols such as SMTP, HTTP, FTP.

Transport Layer

Provides end-to-end communications to applications
Offers its services through the socket interface (TCP, UDP, etc.)
SAPs are called ports; used for application multiplexing:
- Several applications/processes can use the transport service
- Port bound to one application
- PDUs generated by TCP/UDP called segments
TCP has mechanisms for:
- Error control (retransmission-based) and in-order delivery
- Flow control
- Congestion control
  - Packet loss used as sign of congestion; sender will reduce rate

Internet Layer

IP:

PDUs are called datagrams
All higher-layer segments are encapsulated in datagrams
Specifies an addressing scheme
Provides end-to-end delivery of datagrams (forwarding)
Does not specify how routing is done; left to dedicated protocols
Has no mechanisms for error, flow or congestion control
Simple, so IP datagrams can be sent over any network interface
- IP over everything

Physical and Network Interface Layer

Network interface layer:

Similar to link layer
Accepts IP datagrams and delivers them over a physical link

Service Providers and Service Users

An $N$ -protocol implements an $N$ -service; it is an $N$ -service provider which will be used by an $N+1$ protocol; the $N$ -service user.

The service user and provider:

Talk to each other through service primitives
Must obey rules in the usage of services (e.g. open file before reading from it)

There are standard service primitives for a service.

Confirmed Service

Host A’s service user issues a S.request service primitive to the service provider (possibly with some user data)
The service provider at host A generates one or more PDUs (packets) and sends them to host B
The service provider at host B receives the PDU and generates a S.indication primitive to the service user
The service user processes the request and prepares a response, passing it to the service provider through a S.response primitive
The service provider generates a response PDU, sending it to host A’s service provider
Host A’s service provider sends a S.confirm primitive to service user

Unconfirmed Service

The first three steps: S.request, PDU sent to B, then S.indication. Service user A does not know if their request has reached B’s service user.

Confirmed Delivery Service

Host B’s service provider simultaneously sends a S.indication to its service user and an acknowledgement packet to A’s service, which will generate a S.confirm packet to the service user.

Thus, A’s service user knows that B’s service provider received the request, but not if the service user successfully processed it.

Multiplexing

Allows several $N$ SAPs to transmit data over a single $N-1$ SAP.

To do this, the $N$ protocol entity needs to make scheduling decisions to decide which $N$ SAP to serve next.

The sending $N$ protocol entity must include an SAP identifier into the $N$ PDUs so it can be delivered to the right SAP (e.g. UDP has port numbers).

Splitting

Allows a single $N$ protocol entity to transmit data over several $N-1$ SAPs
Thus, the $N$ protocol entity must make scheduling decisions to decide which $N-1$ SAP to use for a given PDU
Mechanisms for sequencing may be necessary

Fragmentation and Reassembly

PDUs may have a limited size: on lower layers, this is usually for physical reasons (e.g. Ethernet)
To make PDU sizes transparent to higher layers, the SAP may accept large packets and partition the data into several $N$ PDUs, called fragments (each with its own header/trailer), and transmit them separately
The fragments must be numbered so the receiver can reassemble them
- What happens if fragments are lost? IP does not have retransmission, so it must conclude after some time that a fragment will not arrive. Then, it can free the buffer (the buffer needs enough space for all the fragments, even if they have not arrived). Otherwise, malicious actor could cause DOS

Blocking and Deblocking

Some higher level produce very small $N$ -PDU
Thus, the transmitter may wait until several $N$ -SDUs are present, blocking the requests, before putting them into a single SDU to save overhead
Receiver entity must deblock the $N$ -PDU into several $N$ -SDUs using the markers placed into the PDU by the transmitter.
When should the sender stop collecting SDUs and start sending?

Sequence Numbers

An $N$ -entity can maintain a sequence number
For each new PDU the sequence number is written into the $N$ -PDU, then incremented
Allows the receiver to:
- Detect duplicate PDUs
- Detect lost PDUs (and possibly request retransmission)
- Put $N$ -PDUs back into the right order when reordered by the network
Implementation issues:
- Overflowing
- Choice of initial sequence number

04. Physical Layering

Fundamentals

Digital Data

Digital data is finite or countably infinite sequence of discrete symbols, each belonging to a finite set called the alphabet.

As any finite alphabet can be represented by a group of bits, any digital data can be represented as a sequence of bits.

Analog Data

Analog data can take on an uncountable number of values which change continuously over time.

Transmission of Analog and Digital Data

With digital data, we want to recover the transmitted sequence of bits exactly, regardless of any distortions or errors introduced by the channel.

When the receiver receives a group of bits, it must choose one out of a finite number of alternatives. The design of the system centers around maximizing the probability that it makes the correct choice.

With analog data, if the channel introduce distortions, the receiver cannot exactly recover the transmitted data as it is faced an uncountably infinite number of alternatives. Thus, the goal of an analog system is to extract the best approximation to the transmitted signal.

Transmission of Digital Data Using Analog Signals

Most transmission media only allows transmission of an analog physical signal taking on a continuous range of values over a continuous time interval
Signal/waveform denotes the evolution of some physical quantity over time
- Modulation is when this signal is modified according to the data

Structure of Communication Systems

Formatting

This step transforms a source signal into digital data (i.e. ADC).

Source Coding (Compression)

The step of encoding digital data with the goal of reducing size. There are two types, lossless and lossy coding.

Lossless coding, also called redundancy reduction, encodes the data such that it can be perfectly restored.

Lossy coding, also called relevancy reduction, is used where humans have tolerance to imperfections, such as video/audio. The decoded data may differ from the original data, but aims to do so while reducing the difference perceived by humans.

Channel Coding

This adds redundant symbols/bits to combat transmission errors in the channel, giving channel decoder a better chance to guess the correct data. Thus, it reduces bit-error probability.

Modulator

Takes the sequence of digital symbols and maps these to a physical signal or waveform of a certain duration (symbol time) - this determines the bit rate.

Could increase the amount of data sent per symbol time, but this will make each symbol closer together and thus more susceptible to interference. To combat this, the power can be increased.

Demodulator and Channel Decoder

The demodulator must produce a guess of what could have been the transmitted waveform (and thus symbol) - this is possible as there is only a finite number of waveforms.

The probability of error, or symbol/bit error rate, is a key performance metric.

The channel decoder uses the redundancy introduced by the channel encoder to correct some or all of the errors made by the demodulator.

In the case of soft demodulation, the demodulator also sends its confidence that a certain symbol has been interpreted correctly.

Source Decoder and Formatter

The source decoder decompresses the compressed source data.

The formatter converts the data to the user.

Baseband Transmission

NRZ (Non-Return-to-Zero) Encoding

A voltage level is associated with each bit value and transmitted for a fixed amount of time - the symbol time. The inverse is the symbol/baud rate.

As one bit is transmitted per symbol time, the symbol rate is the bit rate.

Bipolar NRZ maps $1$ to $V$ and $0$ to $-V$ , while unipolar NRZ/on-off keying maps $1$ to $V$ and $0$ to $0$ volts.

However, more than two levels can be used - a group of $k$ bits can be mapped to $2^k$ levels.

Signal Impairments

Assume unipolar NRZ is used below.

Attenuation

Attenuation - loss of signal power; in communications, this is often the loss of signal power for varying distance (path loss).

It is measured as a ratio of signal power at the transmitter output, $P_tx$ and signal power observed by the receiver input, $P_rx$ :

\eta = \frac{P_{tx}}{P_{rx}}

This is often expressed using decibels:

\eta_{dB} = 10 \cdot log_{10}(n)

If the medium attenuates all frequencies equally, attenuation is a simple division by the attenuation factor.

Unfortunately, the medium does not attenuate all frequencies the same way.

Assuming there is a cut-off frequency in which all frequencies above the cut-off are attenuated completely, the signal will degrade as the cut-off frequency is lowered (fourier series describing square wave; low frequencies sinusoids will remain behind, making it less ‘sharp’).

A more realistic scenario is a low-pass filter.

Thermal Noise

A standard model for thermal noise is additive white Gaussian noise (AWGN) where the noise signal, $n(t)$ is modelled as a zero-mean, normally distributed function with variance $\sigma^2$ :

n(t) = \frac{1}{\sigma\sqrt(2 \pi)} \cdot e^{-\frac{x}{\sigma}^{2}/2}

The observed signal will be $r(t) + n(t)$ .

The noise is usually dependent on temperature.

Passband Transmission

To transmit several data signals simultaneously on a single channel (while avoiding collisions), passband transmission is required.

Data is transmitted by modulating a sinusoidal carrier of a certain frequency; one signal occupies a certain bandwidth around the center frequency.

By choosing center frequencies with sufficient separation, several data signals can be transmitted in parallel in a sufficiently wide frequency band (e.g. FM radio, Wi-Fi bands). This is also called frequency-division multiplexing.

In digital passband modulation, information modulated onto sinusoidal carrier of duration $T$ - the symbol duration.

s(t) = A(t) \cdot cos(t(f_c + f(t)) + \phi(t)), t \in [0,T]

Where $A(t)$ is the amplitude, $f(t)$ is the frequency offset, $\phi(t)$ is the phase, and $f_c$ is the center frequency.

As each symbol will have their own functions, there may be sudden jumps (in theory) in amplitude as it the waveform moves from one symbol to the next.

There are three forms of digital modulation, each of which varies one of the functions:

Amplitude Modulation (Amplitude Shift Keying) varies $A(t)$ :

$s_i(t) = A_i \cdot cos(f_c \cdot t)$ where $A_i$ is the amplitude corresponding to the $i$ th symbol, and $s_i(t)$ the waveform for that symbol.

Frequency Modulation (Frequency Shift Keying) varies $f(t)$ :

$s_i(t) = A \cdot cos((f_c + f_i) \cdot t)$ , where $t\in[0,T]$ and $f_i$ is the frequency offset for the given symbol.

Phase Modulation (Amplitude Shift Keying) varies $\phi(t)$ :

$s_i(t) = A \cdot cos(f_c \cdot t + \phi_i)$ , where $phi_i$ is the phase for the $i$ th symbol.

Quadrature Amplitude Modulation (QAM)

A combination of ASK and PSK, varying amplitude and phase according to the symbol simultaneously.

Amplitude-Phase graph with some (square) number of constellation points equally spread apart, forming a grid.

QAM usually applied with 16, 64 or 256 different constellation points, allowing 4, 6 or 8 bits to be mapped to one QAM symbol. Greater throughput but higher error rate.

Synchronization

Sender and receiver need to agree on how long the symbol period is, the exact center frequency, where packets start and and etc.

They have no common time or frequency reference, with each having their own local time and frequency references (using a crystal oscillators).

Oscillator frequency depends on manufacturing tolerances, ageing, temperature, pressure etc.

Hence, the receiver cannot assume the sender has the same time and frequency reference. Hence the receiver must:

Synchronize its time/frequency reference with the transmitter’s
Track and maintain it during packet reception

Oscillator mismatch and Doppler shift can mess up the frequency, and PSK/QAM require tracking signal phase. Synchronizing frequency and phase is called carrier synchronization.

The transmitter transmit symbols at its own pace, using its local clock
The receiver needs to extract information about symbol duration so it knows when symbols start and end
For longer streams, the receiver clock must be synchronized continuously.
- In NRZ or on-off keying, long runs of the same bit mean long durations with the same voltage/amplitude, causing loss of synchronization
- Thus, the signal level must change sufficiently often

Manchester Encoding ensures there is at least one signal change per bit: a logical 1 is represented by a change from high to low in the middle of a symbol duration, and a logical 0 from a change from low to high. Voltage goes from -V to V.

The receiver will need to sample the incoming signal at a much higher frequency; if the transition occurs in an unexpected sample, the local clock can be adjusted so that that sample will be exactly in the middle of the symbol time.

Frame Synchronization

Ethernet has a preamble, called SYNC, to allow for initial synchronization; a 64 bit long sequence of alternating 0s and 1s. After this is a start-frame-delimiter; SDF. This is required as when the receiver acquires synchronization, it will have no idea when in the preamble it is.

How does the receiver know when the frame ends?

Time gap: medium kept idle between frames
Code violation: deliberate violations in encoding method (e.g. bit with no transition for Manchester encoding)
Length field

Ethernet has length fields, but a bit error could occur in the length field so it cannot be trusted.

Bit stuffing is another option: a protocol has some well-known start and end pattern; the pattern must not be allowed to appear in the payload data.

For example, in HDLC, the flag is 01111110; after a sequence of 5 consecutive 1’s, the sender will insert a zero bit. Thus, if the receiver gets six 1’s, it must be the flag; if it gets 5 1’s followed by a 0, it should remove the 0. Unfortunately, a bit error could occur.

05. Local Area Networks and Ethernet

Misc.

Geometric RV:

Type 1, Bernoulli trial: returns probability of k failures before success: $P(k)=p(1-p)^k$ , Expected value is $(1-p)/p$
Type 2, probability of k trials before first success: $P(k)=p(1-p)^{k-1}$ , expected value $1/p$

LANs

Packet-switched networks
Limited geographic extension; usually less than a kilometer
Usually uses a shared transmission medium to multiple stations
Usually controlled by one particular entity
Often higher data rates than in WAN
Broadcast transmissions possible

Standards

Standards typically specify these layers:

Physical layer (PHY)
- Transmission media
- Network adapters
- Modulation schemes
- Data rates
- Network topologies
Medium Access Control (MAC) sub-layer
- Specifies how stations share the transmission medium
Possibly the link layer; error control and flow control (error control coding often done in hardware in the PHY layer)

IEEE 802 Series

IEEE = Institute of Electrical and Electronics Engineers. ISO has adopted several IEEE standards, including:

802.1: Overview, bridging, network management
802.2 Logical Link Control
802.3: Ethernet
802.11: WLAN
802.15: WPAN including Bluetooth

802.2 and 802.1’s MAC bridging are layer 2 protocols and thus, they are independent of the layer 1 protocols being used (WLAN, MAN etc.).

Physical Topologies

Bus Topology

Rarely used today. All stations are attached via a tap line to the bus. Thus, signals are heard by all stations - it is a broadcast medium, so frames include a destination address. Parallel transmissions may collide, garbling packets.

Due to signal attenuation, the bus length is limited to a few hundred meters at most.

In addition, if the bus cable is damaged, the entire network will stop working.

Star Topology

Stations attached to a central unit via their own private cable. The cable often has separate cables for transmit and receive directions, allowing full-duplex.

The central unit can act as a repeater or bridge/switch. As a repeater, it copies incoming signals to all other outputs, which may congest the network. As a bridge/switch, it copies incoming frames to only where the destination is. The latter allows transmissions to occur in parallel.

There is still a signal point of failure, but the impact of a cut cable is much smaller.

Wireless Topologies

Wireless transmission media are much more prone to errors, and stations cannot necessarily hear all stations - the channel is time-varying and the links are volatile and unpredictable.

MAC Fundamentals

These schemes allow a set of distributed stations to efficiently coordinate access to a common channel. More specifically:

There are a number of stations wishing to communicate
There is a shared communications channel/resource that can be used by one station at a time
No other means for information exchange between stations exists

It is often regarded as a separate sub-layer between the PHY and link-layer.

Assumptions:

The shared channel is a broadcast medium - this may not be true for wireless transmission media
If parallel transmissions occur, all contending transmissions are garbled - once again, this may not be true for wireless transmission media

Desirable properties:

Small medium access delay - time between the arrival of a packet and start of a successful transmissions
- Collisions, waiting times etc. affect this
- Small access delay desirable if medium is lightly loaded
Fairness and fair re-use of unused resources
Low overhead and high throughput
Stability - increasing load should not decrease throughput

Orthogonal MAC Schemes

The behavior of one station does not influence the behavior of other stations. This is done by exclusively allocating resources to individual stations.

FDMA: frequency division multiple access; TDMA: time division; SDMA: space division; CDMA: code division.

FDMA

Each station is given a band of frequencies - a sub-channel they have exclusive transmission access to.

The channel bandwidth is subdivided into N sub-channels, with guard bands acting as safety margins between sub-channels and the fringe of the channel. This reduces interference between adjacent sub-channels.

To receive data, a station must have a separate receiver for each channel or a single tunable receiver - the latter has the issue of coordination as it must switch (related - tuning takes time) to the channel before data is transmitted onto the channel.

Performance

If the total available bandwidth is B (bps), station i is assigned 1/N of B (approximately - guard bands ignored) on a long-term basis.

Medium access delay is 0 as i can start the transmission immediately without risk of collision.

If a packet has size B/N bits, transmission takes one second. Thus:

\begin{aligned}. E[\text{Completion Time}] &= E[\text{Transmission Delay}] + E[\text{Access Time}] \\. &= E[\text{Transmission Delay}] + 0 \\. &= 1 s. \end{aligned}.

Advantages and Disadvantages

N stations can transmit in parallel, and there is no need for time synchronization between the transmitters.

However, each station either needs N receivers or tuneable receivers, frequency synchronization is needed, and there is no re-use.

In addition, there needs to be some way of allocating sub-channels; if there is a central station in charge of keeping track which channels are being used and processing allocations requests, what happens if it crashes?

Hence, FDMA is good for CBR but bad for VBR.

TDMA

Time is subdivided into successive super-frames of duration T_SF, with each super-frame being subdivided into N subframes (plus guard times to compensate for synchronization errors). Hence, time slots are assigned exclusively and on a longer-term basis a station i for transmission.

Performance

Each station gets the full bandwidth B (bps) for 1/N of the time (ignoring guard times).

If:

Station i owns one time slot
T_SF is 1 second
A time-slot has enough bandwidth to transmit B/N bits
A packet of B/N bits arrives at a random time to station i

Thus:

\begin{aligned}. E[\text{Access Delay}] &= \frac{T_{SF}}{2} = 0.5 s \text{ (on average)} \\. E[\text{Completion Time}] &= E[\text{Access Delay}] + E[\text{Transmission Delay}] \\. &= 0.5 s + 1/N s <= 1 s \text{( less than for} N > 2\text{)}. \end{aligned}.

Thus, on average, TDMA allows starting later and finishing sooner due to the full bandwidth being exclusively available.

Advantages and Disadvantages

Compared to FDMA, asymmetric bandwidth assignments is much easier - one station can use multiple time slots rather than having multiple receivers running simultaneously. In addition, it has better completion times on average, and does not require tunable receivers or multiple receivers.

The downsides are that tight time-synchronizations between stations is required and there is a high expected access delay on idle systems.

In addition, TDMA also requires signalling and shared state to allocate time slots and so has the same issues.

Hence, TDMA is also good for CBR but bad for VBR traffic.

Random Access Protocols

These protocols do not reserve channel resources or require a central station/shared state. Access to the medium is unpredictable, using randomness to avoid collisions.

ALOHA

N uncoordinated transmitters and one receiver (a base station). A collision occurs if two packets overlap at the receiver.

When a new packet arrives, it is put in a queue. Once the current packet is transmitted, it checks the queue and transmits the packet after a random backoff.

If the queue is empty, it calculates the checksum and transmits the frame immediately, then starts an acknowledgement timer.

The receiver sends an immediate ack on successful reception of a packet; if the ack timer expires, the transmitter enters backoff mode:

The transmitter chooses a random backoff time
When the backoff timer expires, the transmitter retransmits the frame
If the number of failed trials exceeds the threshold, the frame is dropped.

The choice of random distribution for backoff times plays a key role in controlling the delay, throughput and stability. Often, the distribution depends on the number of collisions seen by the frame.

ALOHA is simple to implement and if the network load is small, new frames are sent immediately - zero access delay, with the full channel bandwidth.

However, if A and B sends packets of the same length, there is a vulnerability period two frames long where a collision will occur if B sends the frame - one frame before and while A is sending the packet.

ALOHA has good throughput for low loads, but beyond a certain point, the increasing collision rate leads to decreased throughput - it is an unstable protocol.

If all packets have the same length and the package transmission time is \tau.

CSMA - Carrier Sense Multiple Access

These protocols assume that stations can determine the state of the medium (busy or idle) (almost) instantaneously.

This is called carrier-sensing or clear channel assessment, and allows for listen-before-talk: before transmitting a frame, it performs the CS operation. If the channel is busy, the station defers the transmission using some strategy. It usually has a maximum number of deferrals/backoffs for a frame. Note that this does not eliminate collisions completely - collisions can occur if the time difference between two stations starting a transmission is smaller than the propagation delay.

This works well when medium the propagation time is small compared to the packet length as it allows other stations to notice transmissions quickly after it is started. Thus, it can occur in local area networks, but is absolutely useless if the sender may have stopped the transmission by the time the receiver detects the start of the frame.

Non-persistent CSMA

After getting a new packet from the higher layer (with an empty queue), the station performs the carrier-sense operation. If the medium is idle, it can transmit immediately but if the medium is busy, it generates a backoff time from a random distribution. Then it can repeat the process until it succeeds or reaches some threshold.

In the case of a collision, a backoff time is chosen and the process repeats. How can collisions be detected? One solution is to expect an acknowledgement for each frame.

Performance issue: there is a high probability that the medium will be idle for some time after the previous transmission finishes (especially if the packet is long), lowering utilization.

If it is known there is a small number of transmitters, using a distribution returning small waiting time reduces the time gaps between packets, increasing throughputs.

p-persistent CSMA

A parameter p is known to all stations.

If the medium is busy, it defers until the end of the ongoing transmission.

It will then divide the time onto time slots, which should be just large enough to run the following:

With probability p it starts its transmission; otherwise, it defers for a further time slots
If it defers, it checks the medium, waiting until the end of the transmission if a new one starts
If a collision occurs, the process starts over

Choosing p is an issue, with a low value leading to stability in high load but low throughput as the medium will be idle after a transmission ends.

1-persistent CSMA or CSMA/CD

If the medium is busy, it defers until the end of the transmission.

When the medium becomes idle, the station sends unconditionally. This avoids idle times between transmissions.

However, if multiple stations start transmitting, a collision occurs and the transmitters detect this immediately - they need to be able to transmit and sense simultaneously. In this case:

The transmission is aborted
A jamming signal is sent to inform all stations about the collision
A collision resolution procedure is started (e.g. backoff)

Ethernet

Frame Format

| Length | Name |. | ------- | ----------- |. | 7 | Preamble |. | 1 | SOF |. | 6 | DstAddr |. | 6 | SrcAddr |. | 2 | Length/Type |. | 46-1500 | Payload |. | 4 | FCS |.

Uses Manchester encoding.

Preamble: for bit synchronization
Start of Frame: end of preamble, also for synchronization
Destination address
Source address
Length/Type field
Payload
- Minimum length due to CSMA protocol
- If higher level sends payload that is too small, the payload is padded: how does the receiver know which bits are padding? They can’t; it must pass it onto the higher layers
  - IP protocol: has length information embedded in it, so must use that
Frame Check Sequence: CRC checksum

Ethernet Addresses

A 48 bit, globally unique address. Each vendor gets its own address range and each Ethernet adapter has its own address.

Addresses are written as six colon-separated bytes in hexadecimal representation.

Special addresses:

The broadcast address is FF:FF:FF:FF:FF:FF
Addresses where the first bit is 1 are multicast addresses (except the broadcast address)
Addresses where the first bit is 0 are unicast addresses

Address filtering: a station will throw away any unicast addresses not addressed to itself.

Length/Type field

When L/T is bigger than or equal to 0x0600, it is a type field: it indicates the higher-layer protocol encapsulated in the Ethernet frame. Hence, it allows protocol multiplexing - a service to multiple higher-level protocols can be provided using a single lower-level SAP. Some common protocols:

| Type field | Protocol |. | ---------- | --------------- |. | 0x0800 | IPV4 |. | 0x0806 | ARP |. | 0x86DD | IPV6 |. | 0x8863 | PPPoE Discovery |. | 0x8864 | PPPoE Session |.

When L/T is less than or equal to 1500, it indicates the length of the payload; the first two bytes of the payload indicates the type (hopefully).

PHY

There are several different physical layers such as coaxial cables, twisted pair cables, and fibre cables.

Ethernet also supports two fundamentally different topologies: broadcast (almost never used today) and switched Ethernet.

Broadcast Topologies

All stations share the medium and hence, MAC is needed
Stations operate in half-duplex mode - they cannot simultaneously send and receive
In Ethernet, this can be achieved using the bus topology with coaxial cable or with hub-based topology
Called Half-Duplex Mode

Switched Topologies

Stations are attached to (possibly cascaded) switches via point-to-point links - a private cable. Hence, no MAC is needed (but is still used) and parallel transmissions are possible, although there may be contention for resources in the switch.

Stations operate in full-duplex mode - each link uses pairs of cables.

This is required for 100 Mbps and higher PHYs - basically all modern Ethernet.

Half-Duplex Ethernet MAC Protocol

CSMA/CD - Carrier-Sense-Multiple-Access with Collision-Detection.

This has very short medium access delays under light load, but the collision rate increases under heavy load with many stations - do not operate Ethernet beyond 50-60% load.

Assumes that parallel transmissions lead to collisions and that stations can monitor the medium for collision while transmitting - this works by checking the voltage on the medium and checking if this exceeds the voltage applied by the transmitter.

A new packet arrives at the MAC. The local collision counter,coll is set to 0
Carrier-sense operation performed
- If idle, the transmission starts immediately
- If busy, listen until the channel becomes idle. Start the transmission immediately
While transmitting, check for a collision. If this is detected
- Abort frame transmission and send a short jamming signal
- Increment coll. If coll > 16, drop the frame and set coll to 0
- Wait for a random time (backoff time) and go to the second step
If no collision is detected, transmit the whole frame and reset coll

Backoff time computation

The computation uses a truncated binary exponential backoff algorithm.

A random integer is uniformly generated the range $[0, 2^{min(10, coll)} - 1]$ - this is the backoff window, and is multiplied by the slot time.

The slot time is a common parameter that is just large enough to cover the maximum round-trip time and processing delays such that if two stations start transmitting one slot apart, the later one will notice.

The algorithm starts by assuming the number of contenders is small - if this is the case, you do not need many windows. If this fails many times, then it is likely that there are many contenders, so you need to increase the number of windows to reduce the probability of collision.

Minimum Frame Size/Slot Time

The minimum frame size (64 bytes for 10 Mbps) is chosen to ensure the transmitter can detect collisions - it can only detect collisions while it is still transmitting.

NB: the preamble and SOF are not included in the frame size calculation.

The maximum cable length and processing time from repeaters/hub is defined in the standard, so it is possible to generate the worst case:

Station A starts transmitting
Just before it reaches station B, station B starts transmitting
Station A can only detect the collision while it is still transmitting
Hence, the worst case delay is twice the propagation delay, plus some buffer for processing time

Bridges and Switches

Often, existing LANs need to be coupled together - this can be done at several different levels.

Layer	Name of device
PHY	Repeater/Hub
MAC	Bridge/Layer-2 switches
Network	Router
Application	Gateway

Marketing departments do not necessarily follow this terminology.

Repeaters and Hubs

Repeater

Repeaters are dumb devices which simply amplify a signal on the analog level. They amplify existing noise and add their own noise to the signal, but have very small delays and are completely protocol or modulation-agnostic.

Regenerating Repeater

These demodulates an incoming signal on a symbol-per-symbol basis and modulates it again.

They do no error checking so regeneration can introduce errors.

Hub

Hubs are centralized repeaters - they broadcast signals coming in on one port to all other ports. Only one station can transmit at a time,.

Hubs may or may not be regenerative.

Basically a bus, except in a star topology, so a single broken cable won’t kill the entire network.

Bridges

Bridges interconnect LANs on the MAC layer. They mostly connect LANs of the same type (e.g. Ethernet), or at least ones with the same MAC address structure. Bridges can be cascaded.

Bridges understand/interpret fields related to the MAC protocol (source/address fields); repeaters and hubs do not.

Basic Operation

Packet transmitted on LAN A
All stations on LAN A, plus the bridge, receive the packet
If the bridge knows the destination address belongs to a station in LAN A, it does nothing
- The bridge learns which stations are attached to which LANs over time by reading the source addresses of the packets transmitted on each LAN.
Otherwise, it transmits it to all other LANs it is connected too
- This is dangerous when several bridges are used and loops are present

Advantages

Reliability: failures in one LAN do not affect others as they are only interconnected by the bridge
Performance: allows several parallel transmissions as long as the packets do not cross LANs
Security: local traffic is not transmitted to other LANs

Layer-2 Switches

Stations are attached to the switch via point-to-point/private links with separate transmit/receive line (full-duplex). Hence, there is no broadcast medium so no need for a MAC. Instead, stations contend for the resources of the switch.

Switches transmit frames only to the correct output port, and can process frames in parallel - hence, they can increase network capacity as compared to LANs with a broadcast medium.

Frames arriving in parallel to the same destination are buffered.

Switches are transparent to stations.

Comparison to hub

Hub builds a broadcast medium - only one station can transmit at a time
A switch can accept up to N parallel transmissions, where N is the number of stations in the LAN
Each attached station can receive packets at full medium capacity

The Internet

A packet-switched network of networks. The networks and links follow the end-to-end principle:

Keep routers dumb, put intelligence in hosts
- Routers know how to deliver packets
- Making deliveries reliable is done by the end host

IPv4

Terminology:

IP Packets are called datagrams unless fragmentation/reassembly is being used - they are called fragments in this case
End stations are called hosts
IP routers are called routers
IP addresses are assigned to network interfaces, not hosts

Best-Effort, Datagram Delivery:

Connectionless: no connection/shared state
Unacknowledged
Unreliable: no retransmissions
Unordered: no in-sequence guarantees

Packet Format

Big endian byte ordering.

Length	Name
4	Version
4	Hdr Len
6	TOS/DSCP
2	Unused
16	TotalLength
Bytes 0-3
16	Identification
3	Flags
13	Fragment Offset
Bytes 4-7
8	Time-To-Live
8	Protocol Type
16	Header Checksum
Bytes 8-11
32	Source Address
Bytes 12-15
32	Destination Address
Bytes 16+
32n	Options + Padding
	Data

Version (4)

Version of IP protocol. 4 for IPv4.

Header Length (4)

Length of header in multiples of 4 bytes. Without options, the header is 20 bytes long so the value would be 5.

TOS/DSCP (2)

Stands for Type of Service/DiffServ Code Point. Allows the priority for each packet to be defined (usually by the ISP depending on how much you pay) - mostly ignored.

Fill with zero.

Unused (2)

Fill with zero.

Total Length (16)

Total length (in bytes) of the datagram - required since some MAC layers may add padding to meet minimum length requirements.

This may be modified during fragmentation and reassembly.

Identification (16)

Identifies each IP payload accepted from higher layers for a given interface - a sequence number.

Flags (3)

Contains two flags for fragmentation and reassembly (DF = don’t fragment; MF = more fragments).

Fragmentation Offset (13)

Gives the offset of the current fragment within the entire datagram, in multiples of 8 bytes.

Time-to-Live (8)

Last resort to deal with loops - IP does not specify routing protocols and so there is a possibility that loops can occur.

This value is the upper limit on number of routers a packet can traverse, and is decremented by each router (hence routers must recalculate checksum) the packet passes through. Once the TTL values reaches 0, the packet is discarded and the sender is notified via an ICMP message.

TTL is typically 32 or 64.

Protocol Type (8)

Determines which higher-level protocol generated the payload - provides different SAPs to allow protocol multiplexing.

Value	Protocol
`0x01`	ICMP
`0x02`	IGMP
`0x04`	IP-in-IP Encapsulation
`0x06`	TCP
`0x11`	UDP

Header Checksum (16)

Checksum for the IP header (NOT the data) - for some types of data, a few bad bits may not matter.

Source/Destination Address (32)

Indicates the initial sender and final receiver of the datagram. Internally structured into two sections: the network and host ID. The network ID refers to some local network, and the host ID must be unique within that network.

Routing and Forwarding

Routing/Forwarding Tables

IP routers have several network interfaces - these are called ports. When a router receives a packet on some input port, it looks at the DestinationAddress field to determine the output port using the forwarding table.

The forwarding table is a table of all networks the router knows about, identified by their network ID, and the output port to send the packet through to reach that network. A table lookup is used to then select the output port for every incoming packet.

Notes:

The time of the lookup is dependent on the number of entries. Expensive routers often implement this in hardware
The table is filled by a separate routing protocol
It is the responsibility of the last router on a path to deliver an IP datagram to the host.

Important: a host address is tied to its location in the network; there is no mobility support - if a host switches to another network, it obtains another address, breaking ongoing TCP connections.

Classless Inter-Domain Routing

How many bits should be allocated to the network ID? Classful addressing used to be used, allocating exactly 8, 16 or 24 bits to the network ID.

CIDR was introduced in 1993 and is now mandatory, and allows an arbitrary number of bits to be allocated to the network.

Netmask

The netmask specifies which bits belong to the network ID. It is a 32 bit value, with the leftmost k bits being 1s, and the remaining bits 0. A /k netmask will have k bits allocated to the network ID. The network ID is calculated using a logical AND with the IP address.

Private host addresses

There are two special host addresses:

The host address with all 0s (32-k zeroes) signifies that the network as a whole is being referred to
The host address with all 1s ((32-k ones) is the broadcast address - all hosts on the network will receive the packet

There are also some reserved IP blocks:

Block	Usage
`10.0.0.0/8`	Private-use IP networks
`127.0.0.0/8`	Host loopback network
`169.254.0.0/16`	Link-local for point-to-point links
`172.16.0.0/12`	Private-use IP networks
`192.168.0.0/16`	Private-use IP networks

These addresses can be used within the provider network; packets with private addresses are not routed in the public internet.

The loopback address is usually denoted as 127.0.0.1, but any valid address in 127.0.0.0/8 can be used.

Simplified Packet Processing

IP input:

Run sanity check on packet
Insert packet into IP input queue
Process IP options (extremely rare - not discussed)
Check if destined to its own IP address or the broadcast address:
- Protocol demultiplexing: check Protocol field, pass TCP/UDP/whatever to the right high-layer protocol
If not:
- Check if forwarding enabled; if not, the device is not a router so drop the packet
- Pass to IP output

IP output:

Check if the packet is an immediate neighbor/directly reachable (e.g. directly via Ethernet)
- If so, attempt to deliver over a single hop
Else:
- Calculate next hop using forwarding table lookup
  - Forwarding table filled and maintained by a routing daemon
- Decrement TTL
- Recompute HdrChksum
- Hand over packet to outgoing interface

Forwarding Table Contents and Lookup (rough approximation)

Each entry contains:

Destination IP address: one of
- Full host address (i.e. non-zero host-id)
- Network address with netmask (more common)
Information about next hop: one of
- IP address of next-hop router (which is directly reachable)
- IP address of directly-connected network (network address/netmask)
Flags
- If destination IP is host or network
- If next hop is router or directly attached to the network
Specification of outgoing interface

A router only knows the next hop, not the entire remaining route. The forwarding table lookup has three stages:

Look for an entry that is a full host address completely matching dst. If found, send packet to indicated next hop/outgoing interface and stop processing
- Extremely uncommon
Look for an entry that is a network address matching dst’s network address. If found, send packet to indicated next hop/outgoing interface and stop processing
- dst & netmask == net-addr & netmask
  - Usually, the network address already has the host bits set to zero, so slightly overkill
Look for a special default entry - if found, send packet to indicated next hop - the default router and stop processing
Drop packet and send ICMP message back to original sender of the datagram

In End-Hosts

Forwarding tables in end hosts generally contain two or more entries:

The default route, typically configured via DHCP
An entry for each network it is directly connected to - this way, it can communicate with other hosts on the network

In Routers

Most routers at the ‘border’ of the Internet (close to the customer) usually have small a forwarding table, often filled with other networks belonging to the same owner. For all other networks, they rely on the default router.

Core routers:

Do not have a default router
Are the default routers of other routers
Know (almost) all of the Internet networks

Fragmentation and Reassembly

The link-layer technologies used by IP have different maximum packet sizes, called the maximum transmission unit. For several reasons, IP abstracts these concerns:

For separation of concerns
A packet could use different link-layer technologies in transit
The route could change due to link failures/restores

Hence, IP offers its own maximum message length of 65515: (2^{16} - 1) - 20, where 20 is the minimum IP header size.

The overall process is as follows:

The sender IP instance partitions the the message into fragments
Each fragment is transmitted individually as a full IP packet
- Header information specifies that this is a fragment, and stores the position of the fragment in the message
Each fragment has a size no larger than the MTU of the outgoing link
- Hence, intermediate routers can fragment a full message or split fragments further, but never re-assemble them (as it is not guaranteed all packets will follow the same path)
The destination IP instance buffers the received fragments, re-assembles the message, then delivers it to the higher layer
- When the first fragment arrives, a buffer large enough for the whole message is allocated and a timer is started

All fragments datagrams belonging to the same message have:

A full IP header
The same value in the Identification field (unique identifier set by the higher layer protocol)
A TotalLength field reflecting the size of the fragment
FragmentOffset field, specifying the offset (in units of 8 bytes) relative to the original, un-fragmented IP datagram
All but the last fragment will have the MF (more-fragments) bit set
- The last fragment does not have this set. It also has a non-zero FragmentOffset to differentiate it from an un-fragmented packet

The sender can set the DF (don’t fragment) bit in the IP header to forbid fragmentation by intermediate routers. If the outgoing link’s MTU is too small, the intermediate router may return an ICMP message.

Address Resolution Protocol

IP datagrams are encapsulated into Ethernet frames, and Ethernet stations pick up a packet only if the destination MAC address matches its own. Hence, stations must know which MAC address a given IP address refers to.

ARP allows a host to map a given IP address to an MAC address. ARP is dynamic:

The MAC address for a given IP address does not need to be statically configured
Nodes can be moved or equipped with new network adapters without any re-configuration

This trades the downsides of static configuration for the complexity of running a separate protocol.

Basic Operation

Two stations, A and B, are attached to the same Ethernet network
- Each has an IP and MAC address
- Each maintains an ARP cache
Station A wishes to send an IP packet to station B’s IP address, but does not know B’s MAC address
Station A broadcasts an ARP-request message containing:
- Its own IP and MAC address
- Stations B’s IP address
Hence, the packet is sent to the Ethernet broadcast address. The length/type field of the Ethernet frame has a value of 0x0806
Any other host recognizes that the IP address is not its own, and discards the packet
- If it cached every binding it ever heard, its ARP cache would get full quickly
When Station B receives this, it:
- Stores a binding between A’s IP address and MAC address in its own cache
- Responds with a unicast ARP-reply packet including:
  - Its own MAC and IP address
  - A’s MAC and IP address
A then stores the binding between B’s IP and MAC address in its ARP cache

ARP makes no retransmissions if an request is not answered.

If a sender wants to send an IP packet to a local destination, it first checks its ARP cache for a binding to the IP address. If not, it starts the address resolution procedure. Once this is finished, it will encapsulate the IP packet in an Ethernet frame and direct it to the MAC address.

Entries in the ARP cache are soft-state; they get discarded some fixed time (usually 20 minutes) after it is inserted into the cache. Some implementations restart the timer after using the entry.

This ensures that stale information gets flushed out (this is much more reliable than that device of interesting notifying everyone that its binding has changed).

Frame format

Standard Ethernet Header
- DstAddr, SrcAddr, FrameType
HardType
- Type of MAC addresses used
- 0x0001 for Ethernet’s 48 bit addresses
ProtType
- Identifier for the higher-layer protocol that requires the address resolution
- 0x0800 for IPv4
HardSize, ProtSize
- Size in bytes of hardware and protocol addresses
- 6 and 4 for Ethernet and IPv4
Op
- Type of ARP packet; differentiates between and ARP-request and ARP-reply
Sender Ethernet/IP address
Target Ethernet/IP address

ARP is a protocol working an a higher-level than Ethernet, so although the destination and source address is included in the packet, it needs to be included again in the Ethernet payload.

Internet Control Message Protocol

Allows routers or the destination host to inform the sender of error conditions such as when:

No route to the destination
Destination host is not reachable
Fragmentation required, but DF set

ICMP messages are encapsulated into regular IP datagrams and like IP, has no error control mechanisms.

ICMP messages are not required, and are often filtered out by firewalls even if they are sent. Thus, the sending host must not rely on ICMP messages.

Message Format

type: 8-bit ICMP message type
code: 8-bit ICMP sub-type
checksum: 16-bit checksum covering the header and data (calculated with checksum being zero)
Data: depends on the type/code combination, but often includes the first few bytes of the offending IP datagram (including its header)

Some common `type`/`code` Combinations

Type	Code	Meaning
`0`	`0`	Echo reply (ping)
`3`	`0`	Destination network unreachable
`3`	`1`	Destination host unreachable
`3`	`2`	Destination protocol unreachable
`3`	`3`	Destination port unreachable
`3`	`4`	Fragmentation required but `DF` set
`3`	`6`	Destination network unknown
`3`	`7`	Destination host unknown
`4`	`0`	Source-quench
`8`	`0`	Echo request (ping)
`11`	`0`	TTL expired in transit
`11`	`1`	Fragment reassembly time exceeded

Destination Unreachable (type=3):

code=0: router has no matching entry (or default entry) for the non-directly connected destination address in its forwarding table
code=1: reached last router before reaching destination, but that router could not reach the directly connected host (e.g. no ARP response)
code=2: reached final destination, but value of protocol field not implemented by receiving host
code=3: used with TCP/UDP when no socket is bound to a port number

Source-quench (type=4, code=0): if IP router drops packet due to congestion. The intention is to let the source host throttle itself, but the act of sending the packet adds more work to the router.

TTL expiration (type=11, code=0): if IP router drops packet due to TTL reaching zero. tracert uses this, incrementing the TTL each time it sends a packet.

Fragmentation reassembly timeout (type=11, code=1): if not all fragments of a message received within a timeout. This invites the higher-level protocol to re-transmit the message.

07. Introduction to Routing

Network Layer Overview

Names of data units:

Application: message
Transport: segment
Network: datagram
Link: frame

Responsibilities of layers:

Network layer: responsible for finding a route from A to B
Transport layer: responsible for transporting data from A to be reliably or easily
Application layer: responsible for sharing data

Layering: each layer takes data from above, adds header information then passes the new data unit to the layer before.

Possible services of a general network layer:

Guaranteed delivery (with/without bounded delay)
In-order delivery
Guaranteed minimal bandwidth
Guaranteed maximum jitter
- Jitter: gap between two consecutive messages

The internet network layer, IPv4 guarantees none of these.

Routing Overview

Routing is a network-wide process that determines the end-to-end paths packets take from source to destination.

The network needs to work out the path from the sending to receiving router automatically in an dynamic environment.

A router uses its routing table to forward a packet to some output link.

The router’s switch fabric - the physical architecture of the router, varies. Some implementations put the network interfaces/line cards close to the CPU to decrease processing times. The cross bar architecture is used to more quickly forward packets to outgoing interfaces.

Hierarchical Overview

There are inter-AS and intra-AS routing algorithms, where AS stands for an autonomous system.

Each ISP is an AS, and some corporations and universities are also ASs.

Each AS is a set of routers and networks managed by a single organization. The routers exchange information via a common routing protocol - the intra routing protocol.

The only inter-AS routing protocol is that is used is BGP. The RIP and OSFP protocols are the two main intra-AS routing protocols. All three are load-insensitive.

There are three types of AS:

Stub AS
- Has only one connection to another AS
- Like a leaf node
Multi-homed AS
- Has more than one connection to other ASs, but does not allow traffic to path through
- e.g. university network will not be doing forwarding
Transit AS
- Has more than one AS and allows traffic to path through
- Run by ISPs

Forwarding vs Routing

Forwarding and routing are two independent processes.

Routing determines the path to take:

Is not done packet-per-packet
Is non-real time: up to two minutes of latency
Requires communication between routers to jointly create forwarding tables

Forwarding transfers packets hop-by-hop:

It determines which outgoing link to use on a packet-by-packet basis
Is done in real time, possibly in hardware

The forwarding table:

Maps a destination address to an interface
Results from either the execution of the routing protocol or is static and preconfigured
Is consulted for every packet
Changes on relatively large timescales (e.g. topology or load changes)

Classification of Routing Algorithms

Routing algorithms solves a routing problem with ideal assumptions. Routing protocols embeds a routing algorithm in a real network environment; it operates in a distributed environment and can work around failure cases.

Classifications

Static/dynamic

Static algorithms change very slowly over time, mostly due to human intervention.

Dynamic algorithms re-compute routes in response to topology or traffic changes. The computation may occur periodically or in response to changes.

Global/Decentralized

Global algorithms assumes each node has global knowledge/state information on the network.

Decentralized algorithms assume no node has complete state information about the network and exchange information with their neighbors.

Load-sensitive/-insensitive

Load-sensitive algorithms assume link costs vary dynamically depending on the level of congestion in the link.

Internet routing protocols such as RIP, OSPF and BGP are load-insensitive and assign a static cost to each link.

08. Introduction to Routing - Link-State Routing

Modelling

Routers are modelled as nodes and links are modelled as edges. Edges have associated metrics:

Cost of the link: delay, monetary costs, distance etc.
Available resources: current capacity etc.

Dijkstra calculates the least-cost path.

Assumptions

Each router has a complete network picture: topology, link costs etc.

To do this, each router reliably floods information about its neighbors to every other router. All routers should have consistent information.

Flooding can sometimes cause broadcast storms; controlled flooding or spanning-tree broadcasts are used to prevent this.

Dijkstra’s Algorithm

After flooding, each router independently calculates the least-cost path from itself to every other router using Dijkstra’s algorithm and generates a forwarding table for every destination.

Definitions:

$u$ : source node
$S$ : nodes whose least-cost path is already known; initially $S=\{u\}$ . One node is added to $S$ each iteration
$c(i, j)$ : cost of edge between $i$ and $j$
$D(v)$ : current minimum cost of path from $u$ to node $v$ . Initially:
- $D(V) = c(u, v)$ if $u$ is adjacent to $v$
- $D(v) = \infty$ otherwise
$p(v)$ : parent/predecessor node of $v$ - vertex before $v$ along the shortest path from $u$ to $v$

Until all nodes are in $S$ :

Find the smallest $D(w)$ where $w$ is not in $S$
Add $w$ to $S$
For all $v$ adjacent to $w$ and not in $S$ :
- $D(V) = min{D(V), D(w) + c(w, v)}$
  - i.e. see if the cost of $w$ plus the cost between $w$ and $v$ is less the current cost of $v$

Running this algorithm, a forwarding table can be generating, mapping some destination to the next hop.

Complexity:

If a min-heap is used, getting the least-cost node $w$ not in $S$ is $O(log(n))$
For each iteration, $D(v)$ needs to be updated for all of $w$ ’s neighbors: $O(n)$

Oscillation with Link-State Routing

Since routers have a complete network picture, if one router finds a cheaper path through node $w$ , it is likely that other routers will also use a path through $w$ . If the algorithm is load-sensitive, this will cause the link cost of paths through $w$ to increase, repeating the cycle.

09. Introduction to Routing - Distance Vector Algorithm

A dynamic, decentralized and asynchronous routing algorithm.

Both RIP and BGP use the Bellman-Ford (distance-vector) algorithm.

Bellman-Ford Equation

if $d_x(y)$ is the least-cost path from $x$ to $y$ , then:

$d_x(y) = min_v{c(x, v) + d_v(y)}$ for all $v$ where $v$ is a neighbor of $x$ .

Distance-Vector Algorithm

$D_x(y)$ is an estimation of the least-cost path from $x$ to $y$

The distance vector, $\vec{D_x}$ , maintains a list of estimates for each node in the graph: $[ D_x(y): y \in N ]$ , where $N$ is the set of all destinations.

Each node, $x$ knows the cost, $c(x, v)$ to each neighbor $v$ , its distance vector, and its neighbors’ distance vectors.

In the algorithm, each node periodically sends its own distance vector estimates to its neighbors. When it receives a new estimate, it can update its own distance vector and if it changes, update its own neighbors. The distance estimates for any node will eventually converge to the actual least cost, $d_x(y)$ .

The algorithm is iterative and asynchronous: each local iteration is called by either a local link cost change or a DV update message from its neighbor.

The algorithm is distributed: each node notifies its neighbors only when its DV changes, and each node only knows the next hop, not the entire path.

Hence, each node is in one of three state:

Wait for change in local link cost, or message from a neighbor
Recompute DV estimate
Notify neighbors if required

Pseudo-code

distance_vector = []
neighbor_distance_vectors = {}

def init():
  for y in destinations:
    distance_vector[y] = cost(self, y) # Infinity if not a neighbor
  
  for w in neighbors:
    neighbor_distance_vectors[w] = [INFINITY for y in destinations]
    send(w, distance_vector)
  
  while True:
    get_update_or_link_cost_change()

    for y in destinations:
      distance_vector[y] = min([cost(self, v) + neighbor_distance_vectors[v][y] for v in neighbors])

    if change_occurred:
      for y in neighbors:
        send(y, distance_vector)

Count-to-infinity problem

Good news travels fast: if a link cost is lowered, this change propagates through the network quickly.

However, there is an issue when a link cost increases. For example, assume $x \rightarrow z > x \rightarrow y \rightarrow z$ . If the link cost from $x \rightarrow y$ increases such that the new best path from $x$ to $y$ is $x \rightarrow z \rightarrow y$ , an issue will occur as the algorithm is asynchronous: $z$ still thinks the best path from $z$ to $y$ is $z \rightarrow x \rightarrow y$ . Hence, it will generate a routing loop: $x \rightarrow z \rightarrow x \rightarrow z\dots$ .

Initial state: link costs of $c(x, y) = 1$ , $c(y, z) = 2$ , $c(z, x) = 20$ .

	x	y	z
x	0	1	3 (y)
y	1	0	2
z	3 (y)	2	0

At $t_0$ , $y$ detects $c(x, y) = 40$ . $y$ thinks the best route to $x$ is now through $z$ :

$D_y(x) = min(cost(y, x) + D_x(x), cost(y, z) + D_z(x)) = min(40 + 0, 2 + 3) = 5$

	x	y	z
x	0	1	3 (y)
y	5 (z)	0	2
z	3 (y)	2	0

At $t_1$ , $z$ receives an update from $y$ and calculates $D_z(x)$ :

$D_z(x) = min(cost(z, x) + D_x(x), cost(z, y) + D_y(x)) = min(20 + 0, 2 + 5) = 7$

	x	y	z
x	0	1	3 (y)
y	5 (z)	0	2
z	7 (y)	2	0

Thus, at each iteration, $D_z(x)$ and $D_y(x)$ increases by 2 (link cost between $y$ and $z$ ) and eventually converge:

	x	y	z
x	0	1	3 (y)
y	22 (z)	0	2
z	20	2	0

Poisoned reverse

If a node $x$ routes packets to $z$ through $y$ , $x$ will lie to $y$ , telling it that $D_x(z) = \infty$ .

Before the link cost change in the example above, $z$ lies to $y$ , saying $D_z(x) = \infty$ as long as it uses $y$ to route to $x$ . In $y$ ’s DV table:

	x	y	z
x	0	1	$\infty$
y	1	0	2
z	$\infty$	2	0

At $t_0$ , $y$ detects $c(x, y) = 40$ :

$D_y(x) = min(cost(y, x) + D_x(x), cost(y, z) + D_z(x)) = min(40 + 0, 2 + \infty) = 40$

	x	y	z
x	0	1	$\infty$
y	40	0	2
z	$\infty$	2	0

At $t_1$ , $z$ , receives an update from $y$ and calculates $D_z(x)$ :

$D_z(x) = min(cost(z, x) + D_x(x), cost(z, y) + D_y(x)) = min(20 + 0, 2 + 40) = 20$ :

	x	y	z
x	0	1	$\infty$
y	40	0	2
z	20	2	0

At $t_2$ , $y$ , updates $z$ , now without lying. $y$ calculates $D_y(x)$ :

$D_y(x) = min(cost(y, x) + D_x(x), cost(y, z) + D_z(x)) = min(40 + 0, 2 + 20) = 22$

	x	y	z
x	0	1	$\infty$
y	22 (z)	0	2
z	20	2	0

Hence, the lie makes it converge faster.

However, poisoned reverse can make the situation worse and cause routing loops. Example:

\begin{aligned} cost(w, y) &= 1 \\ cost(w, z) &= 1 \\ cost(x, y) &= 4 \\ cost(x, z) &= 50 \\ cost(y, z) &= 3 \\ \end{aligned}

Consider only entries only to $x$ :

\begin{aligned} cost(y, x) &= 4 (direct) \\ cost(w, x) &= 5 (w \rightarrow y \rightarrow x) \\ cost(z, x) &= 6 (z \rightarrow w \rightarrow y \rightarrow x) \end{aligned}

So $z$ tells $w$ that $D_z(x)=\infty$ and $w$ tells $y$ that $D_w(x)=\infty$ ; they lie to the node they rely on for routing, ensuring they do not send requests back to them.

At $t_0$ , $y$ detects $cost(x, y) = 60$ and recalculates its vector table: $D_y(x)$ is the minimum of:

$cost(y, z) + D_z(x) = 3 + 6 = 9$
$cost(y, x) + D_x(x) = 60 + 0 = 60$
$cost(y, w) + D_w(x) = 1 + \infty$

Hence, $D_y(x) = 9$ (through $z$ ). Now, $y$ notifies $w$ that $D_y(x) = 9$ and $z$ that $D_y(x) = \infty$ .

Now $w$ recomputes $D_w(x)$ :

$cost(w, y) + D_y(x) = 1 + 9 = 10$
$cost(w, z) + D_z(x) = 1 + \infty$

Hence, $D_w(x) = 10$ (through $y$ ). Now, $w$ notifies $z$ that $D_w(x) = 10$ and $y$ that $D_w(x) = \infty$ .

Now, $z$ recomputes $D_z(x)$ :

$cost(z, w) + D_w(x) = 1 + 10 = 11$
$cost(z, y) + D_y(x) = 3 + \infty$
$cost(z, x) + D_x(x) = 50 + 0$

Hence, $D_z(x) = 11$ (through $w$ ). Now $z$ notifies $y$ that $D_z(x) = 11$ and $w$ that $D_z(x) = \infty$ . Now, $y$ recomputes $D_y(x)$ :

$cost(y, z) + D_z(x) = 3 + 11 = 14$
$cost(y, x) + D_x(x) = 60 + 0 = 60$
$cost(y, w) + D_w(x) = 1 + \infty$

Hence, $D_y(x) = 14$ (through $z$ ). The process repeats.

The computed path from $y$ to $x$ is thus $y \rightarrow z \rightarrow w \rightarrow y \rightarrow \dots$ ; a routing loop occurs.

To fix this, RIP limits the maximum cost of a path is limited to 15.

Comparison of Link-State and Distance-Vector Algorithms

Message complexity:

LS: $n$ , nodes and $E$ , links, $O(nE)$ messages
DV: only exchanging information between neighbors
- Convergence time varies

Speed of convergence:

LS: $O(n^2)$ algorithm, $O(nE)$ messages
- May have oscillations
DV: convergence time varies
- Routing loops
- Count-to-infinity problem

Robustness - behavior when router malfunctions:

LS: node advertises incorrect link cost
- Each node computes only its own table
DV: node advertises incorrect path cost
- Each node’s table used by others; errors propagate through the network

10. NAT, IPv6, RIP, OSPF, BGP

NAT

Motivations:

ISP only needs to allocate IP per customer
Can change addresses of devices in the local network without the outside world needing to know
Can change ISP without changing local addresses
Devices in local network not explicitly addressable by outside world

Each device has an internal IP address, but packets exiting the network have their source address transformed into a single public IP address.

Question: how does the router know which host to forward returning traffic if all devices have the same public IP address?

A table mapping a (external IP, external port) tuple to an (internal IP, internal port) tuple. When the device sends a packet, the router will randomly choose an external port to use, and when the response packet arrives, it can simply map this to the device’s internal IP and port.

The port number is a 16 bit field, so it theoretically allows over 60,000 simultaneous connections.

Controversy:

Port numbers address processes, not hosts
Port numbers are layer 4; routers should only process up to layer 3
Violates end-to-end agreement; hosts no longer talk directly to each other
IPv6 is a better solution

IPv6

Motivations:

IPv4 address space getting full
Header format helps speed up processing/forwarding
- Facilitates QoS

Datagram Format

IPv6 is simpler; no fragmentation, no checksums, fixed length (40 bytes) header:

Version (4), value 6
Traffic class (8): facilitates QoS
Flow label (20): packet identifier
Payload length (16): length of payload in bytes
Next header (8): type of next (upper-layer) header (e.g. TCP, UDP)
Hop limit (8): TTL
Source address (128)
Destination address (128)
Payload

Transition

Old devices only support IPv4, and some routers cannot be upgraded. Hence, ‘flag days’, where the world switches protocol on some specified day, will not work.

With a dual-stack approach, all IPv6 nodes also have complete IPv4 implementations, allowing packets to be transformed in both directions as it is routed. This loses IPv6 or IPv4 specific fields.

Another approach is tunnelling: stuff the whole IPv6 datagram as the payload of a IPv4 datagram. If the packet is small enough and DF is set, a IPv6 router can then parse the payload and forward it as a IPv6 packet.

Due to DHCP, CIDRised addresses and NAT partially solving the IP address shortage problem (in the short term), IPv6 adoption has been slow.

Routing in the Internet

Flat networks don’t scale: if you have millions of hosts, storing routing information for each requires large amounts of memory. If link-state routing is used, broadcasting LS updates takes too much bandwidth, and if distance-vector routing is used, it will never converge.

Hierarchical Routing

Hierarchical routing aggregates routers into autonomous regions, where routers belong to the same organization and run the same intra-AS routing protocol.

Gateway routers have a direct link to a router in another AS.

The forwarding table is configured by both intra- and inter-AS routing algorithms:

The intra-AS algorithm sets entries for internal and external destinations
The inter-AS algorithm sets entries for only external destinations

Routing between ASes

Case 1: a single gateway router to only one other AS:

Just forward all external traffic to that gateway router

Case 2: two or more physical links to other ASes (typically a transit AS):

Learn from the inter-AS protocol that subnet X is reachable via multiple gateways
Use routing information from intra-AS protocol to determine costs of path to each of the gateways
Hot potato routing; choose the closest gateway
Determine interface I that connects to first router on the path to the nearest gateway; insert (X, I) into the forwarding table

RIP - Routing Information Protocol

Uses the distance vector algorithm.

Distance metric:

Number of hops, max 15
Number of subnets traversed along the shortest path from the source router to the destination subnet, including the destination subset (hence it is always at least 1)

To determine subnets, detach each interface from its host/router to create islands of isolated networks; the interfaces are the endpoints of the isolated networks, called subnets.

RIP advertisements:

Distance vectors exchanged among neighbors every 30 seconds via Response Message
- They are also called advertisements, and are sent in UDP packets (layer-4)
- An application-level process (daemon) called route-d manages the RIP routing tables - UDP is level 4 so it cannot be managed at the router level
Each advertisement contains a list of up to 25 destination nets within the AS

Example:

w-- A --x-- D --y-- B-- ... --`z`
    |
    C

The routing table in D would have:

Dest. net	Next router	Num. hops
`w`	`A`	2
`y`	`B`	2
`z`	`B`	7
`x`	NA	1

If A then advertises its own routing table where the cost to z (through C) is 4, D would use the Bellman-Ford equation and update its routing table to route requests to z through A with a cost of 5.

Link Failure and Recovery

After 180 seconds of no advertisements, the neighbor/link is declared dead; routes going through that neighbor are invalidated and new advertisements are sent to neighbors. Through this, the link failure information propagates through the entire network.

Differences between RIP and the DV Algorithm

Can advertise a maximum of 25 subsets
The link cost is always 1 in RIP
Nodes in the DV algorithm are subsets
Instead of distance vectors, it sends the routing table
Routers do not store their neighbors’ DVs
Routers exchange advertisements at fixed intervals, not when there are updates
Does not have the count-to-infinity problem - has the count-to-16 problem

OSPF - Open Shortest Path First

Uses the link state algorithm
Algorithms are disseminated to the entire AS via broadcast messages (flooding)
- OSPF messages sent directly over IP (layer 3), not TCP/UDP
Typically deployed in upper-tier ISPs

Advanced features not in RIP:

Security: all OSPF messages are authenticated
Multiple same-cost paths allowed
Link cost can be set by admins (traffic engineering)
Unicast and multicast support (one-to-one and one-to-group)
Two-level hierarchy

OSPF ASes are configured into areas. Within an area:

Each area runs its own OSPF algorithm
Link-state advertisements do not exit the area
Each node has detailed area topology
Area border routers send summarized distance information to nets in its own areas and advertise to other area border routers

There is one backbone area which links areas together. In the backbone:

Routers that run OSPF routing within the backbone area called backbone routers
Boundary routers connect to other AS’s, typically through BGP

BGP - Border Gateway Protocol

The de facto standard. Provides each AS with the ability to:

Obtain subnet reachability information
Propagate the reachability information
Determines ‘good’ routes to subsets

Pairs of routers - BGP peers, exchange routing information over TCP connections (called BGP sessions).

eBGP sessions allow routers in two different AS’s to exchange reachability information - when it advertises a prefix, it promises that it can forward any datagrams destined to that prefix. The receiver can then create an entry for that prefix in its forwarding table and re-advertise the same information to others via iBGP (between routers in the same AS) or eBGP sessions.

The advertisement is a combination of the prefix and attributes. Two important attributes are:

AS_PATH: not updated in iBGP connections. Contains a list of ASes through which the advertisement has passed - reject if already in the path
NEXT_HOP: not updated in iBGP connections. Advertisements flow from the destination outwards, so it contains the address of the boundary router for the next-hop AS
An import policy is used to accept or decline adverts

Route selection

Routers may learn about more than one route to a given prefix, so it uses elimination rules to pick one:

Local preference value attribute (policy decision)
Shortest AS-PATH (DV algorithm)
Closest NEXT-HOP router (hot potato routing)
Additional criteria

Routing policy can affect routing. For example, a keep silent policy could occur if X is a dual-homed (e.g. for redundancy) customer network connected to B and C provider networks; it will not want to route traffic from B to C and hence hence, it will not advertise to B a route to C, or vice versa.

11. Reliable Data Transfer - Error Detection and Correction

Transmission errors can occur due to factor such as:

Thermal noise
Weak signal strength, interference, jamming, jitter erc.
Router/switch/station crashing
Packets being:
- routed in the wrong direction
- dropped due to congestion
- dropped due to insufficient resources

To correct:

Bit errors due to signal attenuation/noise, use error detection and correction
Buffer overflows due to speed mismatch, too much traffic etc., use flow/congestion control
Lost packets due to buffer overflows, lack of resources, use acknowledgement and retransmission
Out-of-order packets due to retransmissions/packets using different routes, use acknowledgement and retransmission

Error detection

Use redundancy bits, sometimes called error detection and correction bits.

These bits can be used to check if an error has been detected (not that it has occurred).

Parity Check

This is the simplest way of detection an error; append a single parity bit to the end of a block of data.

Even parity: number of 1’s in the data, plus parity bit is even
Odd parity: number of 1’s in the data, plus parity bit is odd

2-D Parity Check

Divide bits into a matrix and use parity bits for each column and row, plus a parity bit for the parity bits (number of 1’s for the row and column parity bits).

Hence, it is possible to determine the bit which was flipped and hence correct single-bit errors.

This is called forward error correction - the receiver can detect and correct errors without needing to transmit a message to the sender.

Internet Checksum

Used for IPv4 header checksum and UDP data checksum (optional in IPv4, mandatory in IPv6. Checksum of part of the IP header (source/destination addresses, plus other fields), the UDP header and data).

Segment contents are treated as a sequence of 16-bit integers.

The checksum is the wraparound sum of the integers (ones complement):

Add the two integers
If there is overflow, add 1 to the result (wraparound)

Once all integers have been summed, invert the bits.

The receiver will sum all segments, plus the checksum, and check that all the bits are 1.

This scheme is considerably less effective than CRC, but also very simple to implement.

Why does IP (not v6), UDP and TCP even need checksums?

Some link-layers do not do error correction
Bit errors can occur anywhere - e.g. router memory
As a final end-to-end check
Certain functionality must be implemented on an end-to-end basis

def ones_complement_addition(shorts):
  result = 0
  for short in shorts:
    result += short
    if result > 0xFFFF:
      result -= 0xFFFF
  return result

def checksum(shorts):
  return 0xFFFF - ones_complement_addition(shorts)

def check_checksum(shorts, checksum):
  return 0xFFFF == (ones_complement_addition(shorts) + checksum)

Cyclic Redundancy Check (CRC)

Assumptions:

d data bits
The sender and receiver agree on an r+1 pattern called a generator, where the left-most bit is 1

The sender chooses r additional bits to append to the data such that the d+r are divisible by the generator using modulo-2 arithmetic.

The receiver then divides the d+r bits by the generator; if there is no remainder, no error is detected.

D: d-bit data block
$F$ : r additional bits (frame check sequence)
$T$ : d+r-bit frame
$G$ : predetermined pattern of r+1 bits

Modulo-2 arithmetic:

Addition/subtraction: exclusive OR ( $\oplus$ )
Multiplication/division: same as base-2 arithmetic, but no carries or borrows

The $D$ bits are shifted left by r bits, so $T = D \cdot 2^r \oplus F$ .

This is divisible by the divisor: hence, $D \cdot 2^r \oplus F = nG$ .

$\oplus F$ both sides: $D \cdot 2^r = nG \oplus F$

Hence, $F$ is the remainder of $\frac{D \cdot 2^r}{G}` (using modulo-2 arithmetic).

Example

G=11 0101, D=10 1000 1101, r=5

Initially, set the F bits to zero:

Divide 11 0101 by T=101 0001 1010 0000:

10 1000 XOR 11 0101 = 01 1011. The divisor and divided have an equal number of bits, so the most significant bit of the result is 1.

Get rid of the leading 0, then pull the next least significant bit from T:

11 0111 XOR 11 0101 = 00 1110. Second MSB of result is 1 again.

Git rid of the leading 0’s, then pull the next least significant bit from T:

1 1101 XOR 11 0101; this will not fit, so add a 0 as the next MSB of the result. Pull the next LSB from T:

11 1010 XOR 11 0101 = 00 1011 …

So result of division is 11 0101 0110 with a remainder of 1110. Thus, F=01110 and the message is T=101 001 1010 1110.

Other notes

It is called a cyclic redundancy check as if you do a circular shift, it is still divisible by G.
CRC can detect consecutive bit errors of r bits or fewer.
CRCs are also called polynomial codes: do the division as sums of powers of X (2?)
Four versions of G are widely used e.g. CRC-32 is sums of X to the power of 32, 26, 23, 22, 16, 12, 11, 10, 8, 7, 5, 4, 2, 1, 0

Forward Error Correction

Useful on wireless applications with propagation delay and high bit error rates.

Messages are divided into blocks of k-bits called datawords; the FEC encoder generates n-bit codewords.

Hamming distance: distance $d(v_1, v_2)$ between two n-bit binary sequences; the number of bits in which $V_1$ and $v_2$ disagree (assuming they are the same length).

Error correction can be done by finding the minimum distance from the received codeword to any valid codeword (and thus dataword). This fails when it is an equal distance away from multiple valid codewords.

Useful properties for the scheme:

Maximize the overall hamming distances between any two valid codewords
Ensure that no codeword is an equal distance away from two codewords

k-error correcting: minimum hamming distance between two valid codewords is at least 2k + 1.

12. Reliable Data Transfer - ARQ Protocols

Lost packets and out of order packets: acknowledgement and retransmission required.

Reliable data transfer protocols based on retransmission are known as Automatic Repeat reQuest (ARQ) Protocols. These include:

Alternating Bit Protocol
Go-Back-N
Selective Repeat (Selective Reject)

Reliable Data Transfer (RDT) Protocols

RDT 1.0: Perfectly-Reliable

A perfectly reliable channel with:

No bit errors
No buffer overflows
No out-of-order packets
No packet loss

RDT 2.0: Bit Errors

All packets will arrive at the receivers, but the packet may be corrupted. Checksumming/CRC required for error detection.

Once the sender transmits a packet, it waits for a NAK/ACK from the receiver:

If NAK, retransmit the packet
If ACK, send the next packet once it is received from the higher layer

This is known as a stop-and-wait protocol as the sender must wait for the NAK/ACK.

RDT 2.1: Handles Corrupted ACK/NACK

In RDT2.0, if the response is corrupted, the sender does not know if the receiver got a corrupted packet.

In this case, the sender will retransmit the packet. As the receiver does not know if it is a new transmission or a retransmission, RDT 2.0 requires sequence numbers.

As there is a response after every single packet, only a single bit is required for the sequence number. If it is a retransmission, it will use the same value but if it is a transmission, it will flip the bit.

RDT 2.2: NAK-Free Protocol

The receiver only sends ACK messages, but it also sends a sequence number along with it.

The ACK sequence number is the sequence number of the last packet that was received successfully.

RDT 3.0: Lossy Channel with Bit Errors

Packets can get lost and bits can get flipped.

If ACK does not arrive after a timeout, the sender should retransmit the packet. Hence, the sender retransmits on both packet/ACK loss or delay.

If the timeout is too long, performance decreases, but if it is too short, duplicate data packets get sent.

This is also known as the alternating-bit protocol.

There are four cases:

No loss
Lost packet
Lost ACK
Premature timeout (ACK received after timeout)

Pipelined Protocols

RDT 3.0 is a stop-and-wait-protocol, so the sender spends most of its time waiting rather than transmitting. If RTT is the round trip time, L is the packet size and R is the transmission rate, then the sender utilization is U_sender = (L/R) / (RTT + L/R) (add L/R on denominator as the entire packet must be received before the ACK can be sent).

Pipelining allows the sender to send multiple packets without waiting for an ACK to arrive. This requires the sequence number to occupy more than one bit, and forces the sender and receiver to buffer more than one packet. There are two basic approaches:

Go-Back-N Protocol

The sender can send multiple packets but can have no more than N unacknowledged packets in the pipeline.

These protocols are sometimes called sliding-window protocols: there is a window of size N that must contain all the unacknowledged packets.

Packets to the left of the window are acknowledged and packets to the left are not yet sent. Inside the window is unacknowledged packets or packets that can be sent immediately.

The window size is related to flow and congestion control.

The protocol uses cumulative acknowledgement: an ACK with sequence number n acknowledges all packets with sequence numbers <= n. If any unexpected packets arrive (e.g. out-of-order), they are discarded and an ACK with the most recently received valid packet is sent.

On timeout, all packets in the window are resent.

Performance problem: if the transmission rate is high, there will be a lot of unacknowledged packets, so if the first packet is corrupted, all the following packets will be discarded.

Sender events:

The sender has data to send or it receives an ACK:
- If there are less than N unacknowledged packets, send the next packet (and slide the window right)
It reaches a timeout:
- Resend all packets in the window

Receiver events:

An invalid or unexpected (e.g. sequence number already received) packet arrives:
- Send an ACK for the most recently received valid packet
A good packet arrives:
- Move the window right and send an ACK for that packet

In the quiz:

Base: first sequence number in the window
Next sequence number: sequence number for next packet to send
Usable sequence numbers: sequence numbers for frames that can be sent immediately

Selective Repeat/Reject

The receiver individually acknowledges all correctly received packets, buffering packets as needed for in-order delivery to the higher layer.

The sender only resends the oldest packet for which the ACK is not received (timeout).

The receiver’s window can have:

Expected packets, not yet received
Buffered packets that arrived out of order
Usable sequence numbers, not yet received

If a packet is re-received, this means the ACK for the packet was lost, and so the receiver should send an ACK for that packet again. If the packet was near the base of the window before the receiver sent the original ACK, this could mean that the receiver window is now ahead of the sender window.

The worst case scenario for window size is:

All but the first packet in the window arrives. ACKs arrive
First packet resent, ACK lost
Receiver window moves forward by $N - 1$ sender does not move
- Sender window is $[base, base + N - 1]$
- Receiver window is $[base + N, base + 2N - 1]$
Sender sends first packet again

To prevent this from being interpreted as a new packet, the sender base number must not be in the receiver window. Hence, the window size must be larger than $2N - 1$ ; in other words, the window size is at maximum half the sequence space.

Hidden assumption: the packets will NOT be reordered within the channel between the sender and receiver - this is reasonable for a single link, but could be broken in an multi-hop network.

If this assumption is broken, old copies of a packet with a sequence number not in either the sender or receiver’s window can arrive. One solution is to have a live time for every packet to ensure it will not stay in the network for very long - ideally, the sequence space will be large enough that it is impossible to fully cycle through the sequence numbers within this time. Note that the receiver will ignore these packets regardless.

13. Transport Layer Protocol - Reliable Data Transfer

TCP Segments

TCP will divide each message into segments - the sequence number for each segment is the index of the first data byte in the segment. In practice, a random sequence number is used for the initial segment’s sequence number.

The receiver’s ACK message contains the index of the largest byte received: the segment’s sequence number plus the number of data bytes.

TCP uses a single retransmission timer.

Simplified TCP sender (without refinements below):

next_sequence_number = initial_sequence_number
send_base = initial_sequence_number
while True:
  if # data received from upper layer:
    segment = create_tcp_segment(next_sequence_number)
    if timer is None: # no unacknowledged segments in the window
      timer = start_timer()
    send_packet(segment)
    next_sequence_number += len(segment.data)

  elif # timer expired:
    # re-send the segment with the oldest sequence number 
    segment = unacknowledged_packets_in_window()[0]
    send_packet(segment)
    timer = start_timer()

  elif # ack received:
    ack = get_sequence_number()
    if ack > send_base:
      send_base = ack # cumulative ACK
      if len(unacknowledged_packets_in_window()) != 0:
        timer = start_timer()

Some refinements:

When the timer expires, double the timeout interval: timeout_interval *= 2
- The expiration may be an indication of congestion
- Typical initial value is one second
When the timer is restarted, reset the timeout interval to its initial value
Timer restarts occur when an ACK arrives or data is received from the upper layer AND there are currently no unacknowledged segments
Fast retransmit: If the sender receives three ACKs with the same segment number, retransmit it before the timeout
ACKs coming back may also be a signal that the network is not too congested

Events on the receiver side:

In order segment with expected sequence number arrives:
- Wait 500 ms; if no segment comes within this time, send an ACK
Above, but there is already another segment with pending ACK:
- Immediately send an ACK, acknowledging the arrival of both packets
Out-of-order segment with higher-than-expected sequence number:
- Immediately send a duplicate ACK with the sequence number of the next expected byte
Arrival of segment that partially/completely fills a gap:
- If the segment fills in the start of the gap, immediately send an ACK

TCP is a mixture of the go-back-N and selective repeat strategies:

It uses cumulative ACK, similar to GBN
Out-of-order segments are often buffered, similar to SR
When there is a timeout, the sender retransmits only the smallest unacknowledged packet, unlike GBN

Setting the Timeout Value

The timeout value should be longer than the round-trip time, so the first step is to find this value.

The SampleRTT will be the time from segment transmission to ACK, with the condition that retransmissions should not be used - the ACK for the first transmission could arrive after the sender retransmits, giving a wrong value.

The RTT varies over time, so several samples are required. This is done using exponential weighted moving average, which puts more weights on more recent samples:

$EstimatedRTT = (1 - \alpha) \cdot EstimatedRTT + \alpha \cdot SampleRTT$ , where $\alpha$ is typically a value such as $0.125$ .

An estimate of the variability is also required:

$DevRTT = (1 - \beta) \cdot DevRTT + \beta \cdot |SampleRTT - EstimatedRTT|$ , where $\beta$ is typically a value such as $0.25$ .

With these two values, the timeout interval can be calculated:

$TimeoutInterval = EstimatedRTT + 4 \cdot DevRTT$

14. Transport Layer Protocols - Flow and Congestion Control

TCP creates a reliable data transfer service on top of IP using pipelined segments, cumulative ACKs, retransmission timers etc.

Setting the window size

The window size is simply the difference between the last byte sent and last byte acknowledged. TCP simply limits this to LastByteSent - LastByteACKed <= min(CongestionWindow, RecvWindow), where the former is determined by the network congestion control algorithm and the latter by the TCP flow control algorithm.

Flow Control

Flow control is a speed-matching service - it matches the rate at which the sender writes against the rate at which the receiver can read.

The sender and receiver allocate buffers; the receiver simply sends the amount of spare room its buffer (TCP is full-duplex):

RecvWindow = RecvBuffer - (LastByteReceived - LastByteRead)

This is written to the 16-bit ‘Receive Window’ section in the TCP header.

Extreme Case

If the receive window is 0 bytes (i.e. buffer full as the higher-level application has not read the buffer contents), the sender will stop sending data.

This will mean that the receiver can no longer send acknowledgement packets back with updated window sizes. Hence, when the RecvWindow is 0, the sender will periodically send 1 byte of data.

Congestion Control

Packet retransmission treats the symptom of packet loss, not the cause, which is typically overflowing router buffers. In fact, retransmissions can make the problem worse.

Congestion control is not so much a service provided to the application, but to the network to the whole.

There are two broad approaches to congestion control. The first is network-assisted congestion control, where routers provide feedback to end-systems by indicating an explicit rate the sender should send at and/or a single bit flag indicating congestion. Today’s systems try to make routers as simple as possible and so do not use this approach.

Instead, end-to-end congestion control is used, where congestion is inferred from end systems observing loss and delay.

Congestion is caused by too many sources sending too much data too fast for the network to handle. This manifests itself through lost packets (buffer overflows) and long delays (queueing).

Congestion control attempts to ensure everyone enjoys a share of resources without causing network congestion. However, there is no predefined share for everyone - it is dynamic.

Assuming that RecvWindow > CongestionWindow (i.e. LastByteSent - LastByteACKed <= CongestionWindow) and there is negligible transmission delay (time between sending first and last byte in a segment negligible), within in one RTT, the sender sends at most CongestionWindow bytes.

Hence, the rate of transmission is approximately CongestionWindow / RTT - CongestionWindow can be adjusted according to network congestion.

The arrival of ACKs is taken as an indication of the network being congestion-free: if ACKs (for previously unacknowledged segments) arrive quickly, the congestion window should be increased quickly (and slowly if ACKs arrive slowly).

After a loss event - a timeout or 3 duplicate ACKs, the congestion window should decrease.

TCP Slow Start

Begin the connection with CongestionWindow = 1 MSS (maximum segment size). This means 1 segment is sent per RTT.

This may be much smaller than the available bandwidth. Hence, the rate should be increased exponentially until the first loss event: double the congestion window every RTT - this is done by incrementing CongestionWindow by 1 MSS for every ACK.

After the first loss event, congestion control moves on to AIMD:

TCP AIMD (Additive Increase, Multiplicative Decrease)

Additive increase (AI): increase CongestionWindow by 1 MSS every RTT in the absence of loss events. This can be done by $CongestionWindow \mathrel{+}= \frac{1 MSS}{CongestionWindow}$ whenever a new ACK arrives ( $\frac{1 MSS}{CongestionWindow} \cdot \frac{CongestionWindow}{1 MSS} = 1 MSS$ ). Additive increase is called congestion avoidance as it increases slowly to avoid congestion.

Multiplicative decrease (MD): halve CongestionWindow after every loss event.

This leads to a saw-tooth pattern in the congestion window size.

Timeout Rules

After three duplicate ACKs arrive, the CongestionWindow should be halved and then the window should grow linearly (AIMD).

After a timeout event, CongestionWindow should be set to 1 MSS and grow exponentially (slow start) until it reaches a threshold - half of CongestionWindow from before the timeout.

Note that the duplicate ACK behavior occurs in TCP Reno; TCP Tahoe’s (Reno’s predecessor) behavior is the same as a timeout event.

15. Transport-Layer Services - TCP and UDP

Transport protocols provide logical communication between application processes running on different hosts.

The sender side breaks the application messages into segments and passes it to the network layer, while the receiver reassembles segments into messages and passes it to the application layer.

Multiple transport layer protocols are available, the most common being TCP and UDP.

The internet transport layer extends IP’s host-to-host delivery to process-to-process delivery (hence requiring multiplexing and demultiplexing), with TCP specifically adding reliable delivery, flow control and congestion control.

Multiplexing and Demultiplexing

Demultiplexing: the host receives IP datagrams; each has source/destination IP addresses and carries a single transport-layer segment. Inside each segment is the source/destination port number.

Connectionless-demultiplexing

With connectionless-demultiplexing (e.g. UDP), a two-tuple of the destination IP and port number can be used to identify the socket. Packets with different source IP and/or port numbers will be directed to the same socket.

Connection-oriented-demultiplexing

TCP sockets are identified by a 4-tuple: all four of the source and destination IP addresses and port numbers are required to direct the segment to the appropriate socket.

Hence, the server host may support many simultaneous TCP sockets through a single port number.

UDP

A bare bones, best effort service: segments can be lost or delivered out of order.

UDP is often used for streaming multimedia applications as this use case is loss-tolerate and rate-sensitive. UDP is also used in RIP, DNS, SNMP.

Segment structure:

|   16 bits   |      16 bits     |
| Source port | Destination port |
|   Length    |     Checksum     |
|              Data              |

The length is the length in bytes of the segment, including the header.

Advantages:

No connection establishment - no delay
Simple: no connection state at sender or receiver
Small header - 8 bytes
No congestion control - sender can send as fast as possible

The UDP socket for an incoming packet can be identified by its destination IP and port.

TCP

Properties:

Point-to-point: one sender, one receiver
Provides a reliable, in-order byte stream
- No application-layer message boundaries
Pipelined: TCP congestion/flow control sets the window size
Full-duplex data: bi-directional data flow
Connection-oriented: handshaking to initialize sender/receiver state
Flow-controlled: sender will not overwhelm the receiver

Packet structure:

| 16 bits     | 16 bits           |
| Source port | Destination port  |
|         Sequence Number         |
|      Acknowledgement Number     |
| Flags       | Receive window    |
| Checksum    | Urg. data pointer |
|     Options (variable)          |
|        Application Data         |

Notes:

Sequence number: byte stream number of the first byte in the segment’s data
Acknowledgement: sequence number of next byte expected from the other side (cumulative ACK and hence acknowledges that all bytes prior to that have arrived)
Urgent data pointer field rarely used
Options can be padded with zeros to ensure its size is a multiple of 4 bytes.

Flags:

Header length (4 bits) - number of bytes in header divided by 4
Reserved (3 bits)
NS: ECE-nonce
CWR: congestion window reduced
ECE: ECN-echo; if SYN ECN capable else packet with congestion experienced flag set
URG: urgent pointer field significant
ACK: acknowledgement field significant - all packets sent by the client should have this set to true, excluding the initial SYN
PSH: push function: asks to push buffered data to the application
RST: reset connection
SYN: synchronize sequence number: only the first packet sent by each end should have this sent
FIN: last packet from the sender

TCP Connection Management

Opening: three-way handshake

NB: segments with SYN or FIN flags are treated as if they have a 1 byte payload.

Step 1:

Client host sends TCP SYN segment (SYN = 1) to server
Randomly picks initial sequence number: client_isn
No application data

The second step both acknowledges the segment and sends the server’s ISN:

Server host receives SYN, replies with SYNACK segment
- Allocates buffers
- Randomly picks initial sequence number: server_isn
- SYN= 1, ACK = 1, ACK_number = client_isn + 1, sequence_number = server_isn

This third step is needed to confirm that the client has received the server’s ISN:

Client host receives SYNACK, replies with ACK
- Response may contain data
- SYN = 0, ACK = 1, ACK_number = server_isn + 1; sequence_number = client_isn + 1

Termination

Step 1: client host sends FIN to server
Step 2: server host receives FIN, responds with ACK
Step 3: once server successfully closes connection (freeing buffers etc.), sends FIN
Step 4: client receives FIN, responds with ACK

Client waits for double the maximum segment lifespan to ensure the ACK arrives.

16. Introduction to the Web and HTTP

Network Applications

Network applications are programs that run on different end-systems and communicate with each other over the network (e.g. web browser and server).

Application protocols are a small part of a network application, and different network applications may share the same application protocol.

Services applications need:

Reliable data transfer (except some loss-tolerant applications like multimedia streaming)
Bandwidth
- Some are rate-sensitive such as audio calls
- Some are elastic e.g. email, FTP
Latency
- Some interactive real-time apps such as audio calls, VR, multi-player games etc. have tight end-to-end delay constraints

TCP provides a reliable transport service, flow control and congestion control, but no guarantees for latency or transmission rate.

UDP provides none of these guarantees.

Application Structure

Client-server:

Server:
- Always-on
- Permanent IP address
- May use server farms for scaling
Clients:
Communicate with the server
May be intermittently connected
May have dynamic IP addresses
Do not communicate directly with each other

Peer-to-peer:

No always-on server
Arbitrary end systems directly communicate
Peers intermittently connected and can change IP addresses
Highly scalable
Difficult to manage

Hybrid. Napster is an example:

File transfer uses peer-to-peer
File search is centralized; peers register content at central server and query it to locate content

The Web

On-demand service, compared to TV or radio which were scheduled

Hypertext Transfer Protocol (HTTP)

A web page consists of a base HTML-file and several referenced objects:

Can be a HTML file (iframe), image (img) etc.
Each object is addressable by a URL

HTTP uses the client-server model:

The browser requests, receives and displays web objects
The server sends objects in response to requests

Uses TCP:

Client initiates a TCP connection to the server on port 80
Server accepts the TCP connection
HTTP messages exchanged between the browser and server
TCP connection closed

HTTP is stateless: the server can work without maintaining any information about past client requests.

HTTP Connections

Non-persistent HTTP: At most one object is sent over a TCP connection. Used by HTTP/1.0.

Persistent HTTP: multiple objects can be sent over a single TCP connection; used by HTTP/1.1 by default.

The client will initialize the TCP connection, send an HTTP request message (containing some URL); the server will receive this response and form a response message containing the requested object.

HTTP/1.0 will close the TCP connection after the object is delivered, while persistent HTTP will leave the connection open and use the same connection for subsequent HTTP requests.

Response Time Modelling (RTT): time to send a small packet to travel from the client to the server and back.

For HTTP/1.0:

1 RTT to initiate the TCP connection (first two parts)
1 RTT for the HTTP request (and third part of handshake) and the request back

Hence, the total time is 2 RTTs, plus the transmission time for the file. After the file is received, the TCP connection must also be closed.

In addition to this, the OS must allocate resources for each TCP connection; browsers will often use parallel TCP connections to fetch referenced objects.

Persistent HTTP with pipelining is the default behavior in HTTP/1.1; the client sends requests as soon as it encounters a referenced file.

HTTP Messages

The HTTP request consists of the following ASCII-encoded text:

Request line (the tree values are separated by a space):
- Method (GET, POST, PUT, HEAD etc.)
- URL
- Version (HTTP/1.0, HTTP/1.1 etc.)
- \r\n to end the request line
Header lines:
- header_field_name: header_value\r\n (space required after the colon)
- Order of headers not significant, but fields sending control data should be sent field
- \r\n after all header lines to indicate the end of the header section
Entity body
- Used for POST and PUT requests

Some methods:

GET: get object
HEAD: get the headers that would be returned if it was a GET request
POST: filling out a form
PUT (1.1+): upload file in entity body to path specified in URL field
DELETE (1.1+): delete file specified in URL field

Some headers:

User-agent
Connection: close|keep-alive: defaults to keep-alive for HTTP/1.1
Accept-language: comma-separated list of languages and optional quality value for each language
- Language code can be language or language and region (e.g. en, en-gb)
- Weight: estimate of user’s preference for that language. lang-code;q=weight,. Defaults to q=1, can be any fraction between 0 and 1 (larger is better)

The response message follows this format:

Status line (space-separated):
- Version
- Status code (e.g. 200, 400)
- Phrase (text description of status code (e.g. OK))
- \r\n
Header lines:
- header_field_name: header_value\r\n (space required after the colon)
- \r\n after all header lines to indicate the end of the header section
Entity body

Some HTTP status codes:

1xx: request received, continuing process
2xx: request successful
- 200: OK
3xx: redirection
- 301/308: permanent redirect. The latter enforces that the request method (e.g. GET, POST) cannot change
- 302/307: temporary redirect. The latter enforces that the request method (e.g. GET, POST) cannot change
- 304: not modified since If-Modified-Since value
4xx: client error
- 400: bad request
- 404: not found
5xx: server error
- 500: internal server error
- 502: bad gateway (proxy/gateway received invalid response from upstream server)
- 503: service unavailable (e.g. overloaded)
- 505: HTTP version not supported

Cookies: User-Server State

HTTP is stateless, so to identify users cookies are used. There are four components:

Cookie header line in HTTP response message
- Multiple Set-cookie lines can be sent
- The header format is Set-cookie: cookie_name=cookie_value
  - By default, cookies are session cookies which are deleted after the current session ends
  - ; Expires = DoW, dd mmm yyyy, hh:mm:ss ttt; can be appended to make it a permanent cookie
  - ; Secure; ensures it is only sent on HTTPS connections
  - ; HttpOnly ensures it is not accessible through JS
Cookie header line in HTTP request message
- cookie: cookie1_name=cookie1_value; cookie2_name=cookie2_value...
Cookie file kept on the client and managed by the browser
Server’s database storing cookies

Web Caches (Proxy Servers)

Proxy servers sit between the origin server and client. There can be multiple proxy servers, one of which will hopefully be closer to the client than the origin server.

The proxy server/cache acts as both a client and server. It helps to:

Reduce response time
Reduce traffic on an institution’s access link
Lower bandwidth costs

If the requested object is in the cache, the browser will use it. Otherwise, it will make a request to the proxy server (and if the proxy does not have the resource, it will in turn make a request to the origin server).

Conditional GET

The server will not send the object if the cache has an up-to-date copy of the object.

Request header: If-modified-since: some_date.

If the object has not been modified since the given date, a 304 Not Modified response will be sent and there will be no body. Otherwise, it will generate a normal 200 OK response.

17. Email and DNS

Email

Three major components:

User agent
- Agent through which the user accesses and sends email
Mail server
- Has a mailbox containing incoming messages
- Has a message queue of outgoing messages to send
SMTP
- Protocol used to exchange information between mail servers
- Mail servers run both the client and server

Simple Mail Transfer Protocol (SMTP)

The standard protocol used to transfer email between servers.

Basic operation:

A uses a user agent (UA) to compose message
A’s UA sends the message to their mail server and places it in a message queue
- Called a email submission
A’s mail server’s SMTP client opens a TCP connection with B’s mail server’s SMTP server
- If multiple messages need to be sent, they are sent over a persistent TCP connection
B’s mail server places the message in B’s client
B uses his UA to read the message

SMTP uses telenet connection and runs on port 25 by default.

Sample telenet interaction:

TCP connection established; server sends three digit reply code (probably 220) and information about the server (host name, server name/version, date/time etc.)
Client sends HELO client_host_name message; server responds (with 250 if valid)
Client sends MAIL FROM: <user@client_host_name> message; server confirms validity
Client sends RCPT TO: <user@server_host_name> message; server confirms validity
Client sends DATA message; server responds with delimiter used to close the email
Client sends body content, ending it with the specified delimiter
- Body content typically restricted to 7-bit ASCII
Server responds with status message
Client sends QUIT message

Mail message format:

From: user@client_host_name
To: user@server_host_name
Subject: subject_text
[BLANK LINE]
message_body

The receiving server appends a Received: header line to the top of the message as a record of the mail servers the email passed through.

Multipurpose Internet Mail Extension (MIME)

For non-ASCII data, MIME is used. It specifies the type of content (e.g. jpeg, mp3) and the method in which it is encoded. For example:

...
Subject: subject_text
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Type: image/jpeg
[BLANK LINE]
base64_encoded data

In base64 encoding, every three octets (24 bits) are divided into four bytes of 6 bits each - enough to fit into an ASCII character. The characters A-Za-z0-9+/ is used, with A corresponding to 0 and / to 63. = is used as padding.

SMTP and Mail Access Protocols

Access protocols are used for communication between the receiver’s mail server and user agent. Two mail protocols are:

POP3 (Post Office Protocol):
- After authorization, the message is simply downloaded
IMAP (Internet Mail Access Protocol):
- More features
- Allows messages stored on the server to be manipulated

Web-based email is another alternative; the user agent uses HTTP to communicate with its remote mailbox.

POP3

TCP telenet connection running on port 110.

During the authorization phase, the client sends two commands:

user username
pass password

After each command is sent, the server responds with either +OK or -ERR.

During the transaction phase, the client can send:

list: list message numbers
retr mail_num: retrieve message by that number
dele mail_num: delete message with that number

IMAP

IMAP:

Keeps all messages in the server
Allows the user to organize messages in folders
Keeps state across sessions
ALlows the user to obtain components of messages instead of the entire message

Domain Name System (DNS)

DNS is:

a directory service
a distributed database implemented in a hierarchy of DNS servers
an application-layer protocol (UDP on port 53)

Some DNS services:

Hostname to IP address translation
- On Linux, this can be done using the command gethostbyname
Host aliasing
- Canonical (CNAME records) and alias names (e.g. www.)
Mail server aliasing (MX records)
Load distribution
- Reply to DNS request can return a list of IP addresses. With DNS rotation, the list can rotate, and as clients typically pick the first address, the load can be distributed to multiple servers

DNS is not centralized:

Acts as a single point of failure
Traffic volume too large
Latency: will be far away from most hosts
Maintenance almost impossible

Hierarchy

DNS is composted of three main levels of DNS servers:

Root DNS servers
Top-level domain (TLD) servers
- com, org etc.
Authoritative DNS servers
- May be provided by organization or service provider
- Non-authoritative DNS servers will cache responses from the latter, so may not be accurate

If a client wants the IP address for www.google.com, to a first approximation, it will:

Query the root server for the com DNS server
Query the com DNS server to get the google.com DNS server
Query the google.com DNS server for the IP address for www.google.com

There are 13 root name servers worldwide (although each is a cluster of replicated servers).

Local name servers do not belong to the hierarchy, but are central to the DNS architecture. Each ISP has provides one and acts as the default name server.

Query Types and Caching

If a client wants the IP address for www.google.com through an iterated query:

The client sends a query to the local DNS server
The local DNS server sends a query to the root DNS server for the IP address of the com DNS server
The local DNS server sends a query to the com DNS server for the IP address of the authoritative DNS server for google.com host
The local DNS server sends a query to the authoritative DNS server for the google.com host for the IP address of www.google.com
The local DNS forwards the reply back to the client

A recursive query puts the burden of name resolution to the contacted name server:

The client sends a query to the local DNS server
The local DNS server sends a query to the root DNS server for the IP address of www.google.com
The root DNS server sends a query to the com DNS server for the IP address of www.google.com
The com DNS server sends a query to the authoritative DNS server for the google.com host for the IP address of www.google.com
The response recursive through the stack back to the client

Both queries require 8 messages. Hence, caching is used:

Once any name server learns a mapping, it caches it
TTL ensures cache entries expire

Hence, root name servers are usually not consulted.

DNS records

Each resource record stores the name, value, type, ttl and class (although that is always IN (internet)).

Some types values:

A: name is hostname; value is the IP address
- IPv6 uses the type AAAA as the IPv6 addresses use four times as many bits
NS: name is the domain, value is the hostname of the authoritative name server for the domain
CNAME: name is the alias name for the canonical name stored in value
MX: value is the canonical name of the mail server associated with the alias name

DNS Messages

Both query and rely messages use the same message format:

|       16 bits      |       16 bits       |
|   Identification   |       Flags         |
|   Num. questions   |   Num. answer RPs   |
| Num. authority RPs | Num. additional RPs |
|        Questions (variable num.)         |
|       Answers (variable num. RRs)        |
|      Authority (variable num. RRs)       |
|    Additional info (variable num. RRs)   |

The query and reply to the corresponding query have the same identification number.

The flags indicate;

If the message is a query (0) or reply (1)
If the query is recursive or not
- If a reply, it will indicate if the server supports it

The additional fields:

Questions ((Name, Type))
- Reply will repeat these
Answers ((Type, Value, TTL))
Authority: records for authoritative servers
Additional information (e.g. IP address of authoritative DNS server)

Adding New Records

Example: registering a domain name at a registrar:

Need to provide registrar with host names and IP addresses of primary/secondary authoritative DNS servers
Registrar inserts two RRs into the TLD server for each authoritative name server
- (host_name, authoritative_DNS_server_host_name, NS)
- (authoritative_DNS_server_host_name, authoritative_DNS_server_IP_address, A)
Go to the authoritative name server and insert records pointing to your server there

Mid-Term Test Notes

Sockets

IP: stateless, unacknowledged, unreliable, unordered.

TCP/UDP add ports on top of IPv4 for addressing.

TCP/Stream:

Acknowledgements of failure
Record boundaries not preserved

UDP/Datagram:

One write, one packet; one read, one packet

API:

bind: socket to IP port/address family
listen: declare accepting incoming requests
- accept: waits until request received. Returns a new socket
connect
- TCP: request establishment of connection
- UDP: default receiver when using send/write
read/recvfrom: get received data. Blocking if no data
write/sendto: send data. Blocking if buffer is full
sendto/recvfrom specifies port/address family of sender/receiver
close: free resources

Network byte order: big endian.

Protocol Layering

Service access point: interface to service for higher layer.

| Header | Higher-layer payload/service data unit | Trailer |

OSI

Physical/Link layers single-hop scope.

Network layer use single-hops to achieve end-to-end communication.

Routers implement physical, link and network which are intentionally kept simple (and thus allows hardware implementations for high performance).

Physical:

Bits to modulation of signal
Specifies signal, modulation, bit synchronization etc.

Link:

MAC: flow control etc.
LLC: frames, error detection/coding, retransmission

Network:

Addressing and routing; end-to-end delivery of messages
Specifies addressing format, exchange of routing information, possibly establishment etc. of connections

Transport:

Reliable, in-sequence, transparent end-to-end transfer
Error control to guard against routers dropping packets
Congestion control (stopping network overload, fair resource allocation) and end-to-end flow control (stopping receiver overload)

Session:

Communication sessions between applications
May involve several transport layer connections
Controls duplex/half-duplex/simplex operation

Presentation:

Endian conversion

Application:

High-level support functions e.g. HTTP

TCP/IP reference model: physical/network interface/internet/transport/application.

Transport:

End-to-end communication using the socket interface
SAPs called ports; multiplexing over a single network adapter
PDUs generated by this layer called segments

Service Primitives

Confirmed service:

User A sends request (with data?)
Provider A generates PDU(s)
Provider B receives PDU(s), sends indication to user B
User B generates response (with data?)
Provider B generates response PDU
Provider A receives response, sends confirm to user A

Unconfirmed: no response or confirm e.g. UDP.

Confirmed Delivery: Provider B sends PDU when request PDU received regardless what user B does.

Multiplexing:

Several N SAPs transmitting data over a single N-1 SPA
Scheduling to determine which SAP to serve
Requires identifier (e.g. port numbers)

Splitting:

Single N entity using several N-1 SAPs
Scheduling to determine which SAP to use for a given PDU
May require sequencing mechanisms

Fragmentation:

When PDU is too large for the lower-layer protocols; partition into fragments
Numbering of fragments

Blocking:

Buffering small SDUs into a single PDU
Markers required to separate SDUs

Physical Layer

Formatting: ADC - source signal into digital data
Source coding/compression. Lossy coding/relevancy reduction and lossless/redundancy reductiom
Channel coding: adding redundant information for error correction - reduces bit-error probability
Modulator: maps bits to physical signal/waveform of a certain duration - symbol time
Baseband transmission. Bipolar NRZ: 1=x V, 0=-x V. Unipolar: 1=x V, 0=0 V
Attenuation: signal power at transmitter divided by receiver. Lower is better
Thermal noise: normal distribution added to signal

Passband transmission:

s(t)= A(t)\cdot cos[(f_c + f(t)) \cdot t + \phi(t)]

Where t is less than duration T, the symbol duration, A(t) is the amplitude, f(t) is the frequency offset, f_c is the center frequency and \phi(t) is the phase.

Amplitude shift keying: varies A(t) for each symbol
Frequency: varies f(t) for each symbol
Phase: varies \phi(t)

Quadrature Amplitude Modulation: varies amplitude and phase. n unique amplitudes and n unique phases for n^2 symbols. Greater throughput but higher error rate.

Carrier synchronization: synchronizes frequency and phase.

NRZ can lead to long runs of the same bit and loss of synchronization.

Manchester encoding: [-x, x] voltage range:

1: high to low in middle of symbol duration
0: low to high “”

Frame synchronization, ethernet: preamble at start for carrier synchronization, then start-frame-delimiter.

End of frame detection: gap between frames, deliberate code violations, dedicated length fields, special flags (with escape codes).

LAN

Geometric RV:

Type 1, Bernoulli trial: returns probability of k failures before success: P(k)=p(1-p)^k, Expected value is (1-p)/p
Type 2, probability of k trials before first success: P(k)=p(1-p)^(k-1), expected value 1/p
Bus: tap line attaches all stations to bus. Broadcast medium
Star: private full-duplex channel for each station. Central un it is a repeater or bridge/switch
MAC: multiple stations, shared channel usable by one station at a time, no other channels, broadcast medium
Want: small medium access delay, fairness, reuse of resources, low overhead, stability

Orthogonal: behavior of one station independent of another.

Frequency division multiple access:

Bandwidth divided into sub-channels + guard channels
Separate receivers/tunable receiver - latter requires switching before transmission starts
0 medium access delay, B/n bandwidth

Time:

Full bandwidth for 1/n of the time. Avg MAD is half superframe time plus B/n
- Avg better than FDMA for n>2
Requires time synchronization

Both require some way of allocating sub-channels/time slots.

Random access protocol: no central station/shared state.

ALOHA:

Transmit packet + checksum immediately
Receiver sends immediate ack
If timer expires, random backoff time chosen; repeat until threshold reached
vulnerability period: if a station sends a packet of size n, vulnerability period one frame before and during sending of packet (if that packet has the same size)
Unstable for high loads due to increasing collision rates

CSMA - carrier sense multiple access. Carrier sense: almost instantaneous check of if the medium is busy. Fails if time difference between two stations starting collisions is smaller than the propagation delay.

Non-persistent CSMA:

If medium busy, generate random backoff time, then check again
High probability of medium being idle after transmission finishes

p-persistent CSMA:

Time is divided into very small frames
Once medium is free, probability p of starting transmission. If collision occurs, restart
If defers, repeats the process the next time slot
Low p leads to stability in high load but low throughput in low load

1-persistent: CMSA/CD:

Send unconditionally at end of transmission: avoid idle time
Detect collision immediately if multiple stations transmit
- Abort transmissions
- Send jamming signal
- Begin collision resolution

Ethernet

Manchester Encoding:

Length	Name
7	`Preamble`
1	`SOF`
6	`DstAddr`
6	`SrcAddr`
2	`Length/Type`
46-1500	`Payload`
4	`FCS`

MAC address: 48 bit globally unique. If first of address is 1, multicast. Else, unicast address.

If L/T >= 1500, type field (e.g. IPv4, ARP) to allow for protocol multiplexing. Else, length field. Type assumed to be first two bytes of payload.

Broadcast: half-duplex, MAC needed, 10 Mbps PHYs.

MAC protocol: CSMA/CD:

coll set to zero
Carrier-sense; once medium idle, transmission starts immediately
If collision detected
- Abort, send jamming signal, increment coll
- Drop packet if coll > 16
- Wait for backoff time
  - Backoff window: [0, 2^{min(10, coll) - 1}]
  - Multiply by slot time; predefined time large enough for max round-trip time
Minimum frame size: transmitter can only detect collision while it is still transmitting by measuring voltage on medium
Repeater: amplify signal on analog level
Regenerating repeater: demodulate then modulate on a symbol-by-symbol level. No error checking
Hub: centralized repeater; broadcast signals to all ports. Basically a bus
Bridge: interconnects LANs on the MAC layer - must have the same MAC address structure
- Peeks at source address to learn what stations are on what LANs
- If destination address unknown, broadcasts packet to all other LANs
Switches: private full-duplex links for all stations. Frames transmitted to only the correct port - allows parallel transmissions

IP

Big endian byte ordering

Length	Name
Bytes 0-3
4	Version (=4)
4	Hdr Len (numBytes/4)
6	TOS/DSCP (zeros)
2	Unused (zeros)
16	TotalLength (bytes)
Bytes 4-7
16	Identification
3	Flags (0, DF, MF)
13	FragmentOffset (offset/8)
Bytes 8-11
8	Time-To-Live (usually 32/64)
8	Protocol Type
16	Header Checksum
Bytes 12-15
32	Source Address
Bytes 16-19
32	Destination Address
Bytes 20+
32n	Options + Padding
	Data

Last fragment has MF unset, and non-zero FragmentOffset to differentiate from an unfragmented packet.

Protocol type: determines higher-level protocol that generated the payload (e.g. UDP/TCP).

Routing table: map from destination address to output port.

Host address of all 0s: network as a whole being referred to. All 1s: broadcast address.

Lookup:

Check for full host address match
Check for network address match
Use default
Drop packet, send ICMP message

End hosts have default route + entry for each network it is directly connected to.

Core routers: know almost all internet networks, have no default routers.

Address Resolution Protocol: map from IP to MAC:

Each has ARP cache, entries discarded fixed time after insert (or last use)
Broadcast message containing own IP/MAC + IP of target
Target responds with own IP/MAC plus requester IP/MAC

Internet Control Message Protocol - optional protocol used to inform sender of error.

Test Notes

Distance-Vector

Periodically send $D_x$ , node $x$ ’s estimates of the least cost path to all nodes in the networks: $D_x(y) = min([C(x, v) + D_v(y) \text{ for } v \text{ in } neighbors(x)])$ , to its neighbors.

Poisoned reverse: if $x$ routes to $z$ through $y$ , $x$ will say that $D_x(z) = \infty$ , ensuring $y$ will not route to $z$ through $x$ if the link cost changes. This reduces the time to converge when link cost increases.

RIP

Metric: number of hops (number of subsets traversed())
Max path cost: 15
DVs sent to neighbors every 30 seconds via UDP
- Each advertisement contains at max 25 entries
Table stores destination network, next hop and number of hops
After 3 minutes of no advertisements, node declared dead

Link-State (Dijkstra)

Each router reliably floods information about its neighbors and independently calculates the least-cost path to every other router.

Add nearest node not in tree to the tree, calculate distance to each neighbor through the node.

$O(n^2)$ algorithm, $O(nE)$ messages.

OSPF

Authenticated advertisements over IP (not UDP/TCP)
Multiple same-cost paths allowed
Link cost can be set by administrators
Uni/multi/broadcast support
Used mostly in upper-tier ISPs

Hierarchy

Can be configured into areas. Each area runs OSPF:
- Broadcasts only within the same area
- Each node has detailed topology about its own area
- Inter-area routing through area-border routers
  - ‘Summarizes’ distance to networks in its own area
One backbone area which connects all areas
Boundary routers connect to other ASes

IPv6

Version (4): 6
Traffic class (8): QoS
Flow label (20): identifier for a group of packets
Payload length (16): size of data in bytes
Next header (8): payload type (e.g. UDP, TCP)
Hop limit (8): TTL
Source/Destination: 128 bytes each

When forwarding packet through IPv4 router:

Dual stack: transform to IPv4, lose metadata
Tunnelling: stuff IPv6 packet into IPV4

BGP

Pairs of routers, either in the same (iBGP) or different AS (eBGP), exchange information over TCP. If a node advertises a prefix, it promises that it can forward datagrams to the prefix. Some attributes:

AS_PATH: not updated in iBGP connections. Contains a list of ASes through which the advertisement has passed - reject if already in the path
NEXT_HOP: not updated in iBGP connections. Advertisements flow from the destination outwards, so it contains the address of the boundary router for the next-hop AS

Import policy used to accept/reject advertisements.

Route selection uses local preference (policy decision), shortest AS_PATH, closest NEXT_HOP router etc.

Reliable Data Transfer

Internet checksum:

def ones_complement_addition(shorts):
  result = 0
  for short in shorts:
    result += short
    if result > 0xFFFF:
      result -= 0xFFFF
  return result

def checksum(shorts):
  return 0xFFFF - ones_complement_addition(shorts)

def check_checksum(shorts, checksum):
  return 0xFFFF == (ones_complement_addition(shorts) + checksum)

CRC

d data bits and r checksum appended to the end.
Agreed-upon r+1 bit pattern, with MSB being 1
To find r: Initially set r to 0, do long division (XOR) and get remainder
Can detect consecutive bit errors up to r bits long.

RDT

1.0: no bit errors, buffer overflows, out-of-order packets, packet loss

2.0: bit errors; checksum. Stop and wait for ACK/NACK after each packet

2.1: 1-bit sequence numbers in case ACK/NACK corrupted

2.2: No NACK. Sends ACK with sequence number - last packet that was received successfully

3.0: bit errors and lost packets. Assumes packets are not reordered in the channel (although this can be solved by having a time to live and sufficiently large sequence space)

Go-Back-N

No more than N unacknowledged packets. Receiver sends cumulative ACK. On timeout, all packets reset.

Selective Repeat

An ACK for each packet.

On timeout, resend oldest packet without ACK. If duplicate packet received, send ACK for that packet again.

Worst case scenario:

All but the first packet in the window arrives. ACKs arrive
First packet resent, ACK lost
Receiver window moves forward by $N - 1$ sender does not move
- Sender window is $[base, base + N - 1]$
- Receiver window is $[base + N, base + 2N - 1]$
Sender sends first packet again

To prevent this from being interpreted as a new packet, the sender base number must not be in the receiver window. Hence, the window size must be larger than $2N - 1$ .

UDP and TCP

UDP

|   16 bits   |      16 bits     |
| Source port | Destination port |
|   Length    |     Checksum     |
|              Data              |

Identified by destination port and IP.

TCP

| 16 bits     | 16 bits           |
| Source port | Destination port  |
|         Sequence Number         |
|      Acknowledgement Number     |
| Flags       | Receive window    |
| Checksum    | Urg. data pointer |
|     Options (variable)          |
|        Application Data         |

Sequence number: byte number for first byte in segment
Acknowledgement: next expected byte number
Flags starts with 4-bit header length - bytes in header divided by 4
LastByteSent - LastByteACKed <= min(CongestionWindow, RecvWindow)

Handshake and Termination

Segments with SYN or FIN treated as if they have a 1 byte payload.

Three-way handshake:

Client sends TCP SYN segment
Server replies with SYNACK: ack = client_isn + 1
Client replies with ACK: ack = server_isn + 1, sequence_number = client_isn + 1

Termination:

Client sends FIN
Server responds with ACK
Once connection closed, server sends FIN
Client responds with ACK, waiting twice the max segment lifespan to ensure the ACK arrived

Acknowledgements

ACK: largest byte received (cumulative ACK).

Sender events:

One retransmission timer timer
If timer expires, resend oldest unacknowledged packet
- Double timeout and restart timer
When ACK received, restart timer (if received ACK is not for base packet)
Fast retransmit: if three ACKs with the same segment number received, retransmit it immediately

Receiver events:

If in-order segment arrives, wait 500 ms. If another in-order segment arrives in this window, send the ACK immediately
If out-of-order packet arrives, buffer and send duplicate ACK

Timeout value requires finding round trip time: use exponential weighted moving average:

$EstimatedRTT = EstimatedRTT (1 - \alpha) + SampleRTT * \alpha$
$DevRTT = devRTT (1 - \beta) + \beta|SampleRTT - EstimatedRTT|$
$\alpha \approx 0.125, \beta \approx 0.25$
$Timeout = EstimatedRTT + 4 * DevRTT$

Flow control

Send receive window - amount of free space in buffer
If receive window 0, need to periodically send TCP segment

Congestion Control

Overflowing router buffers:

Slow start:

Congestion window = 1 MSS (1 segment per RTT)
Double window every RTT (increment by one after every ACK)

AIMD (after first loss):

Increase by 1 after every RTT (window += 1/window)
Half after every loss

After 3 duplicate ACKs, half window and move to AIMD.

After timeout, slow start until it reaches half of window size before timeout. Then move to AIMD.

LastByteSent - LastByteAcked <= min(CongestionWindow, ReceiveWindow)

Lab 4

First ping to station on the same network slow as ARP request required
Minimal forwarding table: 0.0.0.0/0 pointing to default gateway

Ping failure modes:

Network unreachable: no entry in forwarding table
Destination host unreachable: directly-attached network, ARP fails
Destination port not reachable: TCP/UDP packet, port not bound
No response: if routing table of receiver empty

ASes:

Stub: leaf node connected to single AS
Multihomed: connected to multiple ASes
Transit: multihomed but traffic can path through