08. Reliability, Resilience and Security

Ethics

Equality: same treatment for everybody

Equity: customized treatment to ensure everyone has the same opportunity

Algorithms and AI: garbage in, garbage out. If the dataset it is fed is biased, the output will be biased.

ACM code of ethics. TL;DR: respect everyone + make mistakes and reflect on your mistakes.

Faults, errors and failures:

To improve reliability:

Availability and reliability:

Works-as-designed problem:

Reliability can be subjective, affecting only a subset of users:

Capacity management:

Architectural strategies:

Protection systems
- Systems that monitor the execution of others
- Trigger alarms or automatically correct the behavior
Multiversion programming
- Concurrent computation
  - Hardware with different items/providers
  - Software with different development teams
  - Voting systems e.g. triple-modular-redundancy

Visibility: need-to-know principle; if variables/methods don’t need to be exposed, don’t expose them
Validity:
- Check format and domain of input values (including boundaries)
- Use if statements or regression-test-enabled assert statements
Avoid errors becoming system failures by capturing them; never send back an error message with the stack trace
Erring:
- Avoid untyped languages
- Encapsulate ‘nasty’ stuff
Restart: provide recoverable milestones so that it can restart into a good state
Constants: express fixed or real-world values with meaningful names

Unicorns:

Security treats come from:

Recognition: how an attacker may target an asset
Resistance: possible strategies to resist each threat
Recovery: plan data, software and hardware recovery procedures
- And test the restore procedure
Reinstatement: define the process to bring the system back

Resilience planning:

Checklist: