04. AI Anomaly Detection

Background

Types of Anomaly Detection

Contextual
Collective
Online
Distributed
Point
- Classification
  - Rule
  - Neural networks
  - SVM
- Nearest Neighbor
  - Density
  - Distance
- Clustering
- Statistical
  - Parametric
  - Non-parametric
- Other
  - Information theory
  - Spectral decomposition
  - Visualization

Neural Networks

1990s:
- Neural networks with very few (e.g. 3) layers
- Too limited/sensitive, required lots of manual tweaking
- Linear classifier: cannot learn XOR function
Deep learning: hundreds of layers
- Accuracy high enough for commercial use
Convolutions neural networks
- Between layers, convolutions applied and sub-sampling occurs
ML vs DL:
- ML: feature extraction done by humans
- DL: relevant features automatically extracted from input
CNNs: repeat (convolution, sub-sampling) steps for a few layers, then form full connections and Gaussian connections before outputting results.

Introduction

Too much data, not enough analysis.

Anomalous events occur relatively infrequently, but when they do occur, their consequences can be devastating.

What are anomalies?

Patterns in the data that do not conform to the expected behavior
- Patterns/outliers/exceptions/peculiarities
Often translates to real-life entities
- Credit card fraud
- Cyber intrusions

Point Anomaly Detection

Classification-based Techniques

Main idea: build a classification model for normal/anomalous events based on labeled training data.

Models must be able to handle skewed/imbalanced class distributions: anomalous events are rare by definition.

Must be able to handle skewed/imbalanced class distributions: anomalous events are rare.

Categories:

Supervised classification
- Both normal and anomaly classes labeled
  - Subscribe to many sources to get examples of new anomalies as they appear in the world; hard to get good data set of anomalies within a single organization
- Build classifier to distinguish between normal and anomalous behavior
- High accuracy
- Cannot detect unknown and emerging anomalies
Semi-supervised classification
- Requires only labels from the normal class
- Detect deviations from normal behavior
- Possible high false alarm rate - previously unseen but legitimate data records may be incorrectly categorized

Techniques:

Manipulating data records
- Oversampling the rare class
  - Make duplicates of the rare class to balance out the two classes
  - Does not increase information but increases the mis-classification cost
- Undersampling the majority class
  - Techniques: random, near-miss examples, examples far from decision boundaries
  - Results in general information loss and overly general rules
- Generating artificial anomalies
  - Synthetic minority over-sampling techniques: generate rare class examples inside regions or near the edges of existing rare class examples
Rule-based
- Adapting existing rule-based techniques:
  - Adapting multi-class classification methods to a single-class classification
- Association rules
  - Rules with support higher than some pre-specified threshold may characterize normal behavior
- Case specific feature weighting
  - Basically any classifier will require you to set some sort of threshold
  - Decision tree learning: each rare class test examples replaces the global weight vector with a dynamically generated weight vector that depends on the path taken by that example
- Case specific rule weighting:
  - LERS: learning from examples based on rough sets; increases rule strength for rules describing the rare class
- PN-rule learning
  - P-phase:
    - Create rules that cover most positive examples with high support
    - Phase focuses on gaining high recall
  - N-phase
    - Add rules which remove the false positives
    - Phase focuses on high accuracy and support
- CREDOS
  - Graph theory approach: create decision tree
  - Ripple down rules: overfit the training data
    - Create binary tree with a rule on each node
    - Overfit the training data
    - Then prune the decision tree to generalize
Model-based:
- Deep learning/Neural networks
  - Multi-layer perceptrons
    - Measure activation of output nodes
    - Extend learning beyond decision boundaries
  - Auto-associative neural networks
    - Replicator neural network
      - Multi-layer, feed-forward neural network
      - Same number of input and output nodes; variables should match between input and output
      - Hidden layers have fewer nodes, forcing the RNN to come up with a compressed representation of the data
      - Measure outlyingness by checking the reconstruction error of each data point:
      - Model is designed to compress normal data, so anomalies will not be compressed as well
    - Hopfield networks
  - Radial basis functions
    - Can interpolate between sparse data points
    - Adds reverse connection from output to central layer
      - Each neuron has an associated normal distribution; any new instance that does not fit any of the distributions is an anomaly
- Support vector machines
  - Normal data records belong to high-density data regions
  - Use unsupervised approach to learn high/low density data regions
  - Use SVM to classify the data density level and hence detect anomalies
    - Tensor calculus
    - Transform data points into a different (higher-dimensional) space to maximize the distance between the two categories
    - Linear; uses hyperplanes for classifications
    - RBF: radial basis function kernel
      - Computes similarity between two points
      - $K(X_1, X_2) = e^{-\frac{\| X_1 - X_2 \}^2}{2\theta^2}}$
        
        $1$ when distance is 0
        
        $\theta$ is the variance and SVM hyperparameter
      - As variance decreases, number of support vectors increases, making the separating surface more complex
- Bayesian networks
  - Typical: aggregate information from different variables to estimate the probability that events belongs to the normal/anomalous class
  - Naive Bayesian classifiers
    - Incorporates prior probabilities to classify events as normal/anomalous
  - Psuedo-Bayes estimators
    - I stage: learn prior, posterior of unseen anomalies from training data
    - II stage: use naive Bayesian classifiers to classify instances into normal instances, known anomalies and new anomalies
  - Hidden Markov model
Cost-sensitive classification
- Thresholding based on possible cost to the organization
Ensemble:
- Boosting: learning algorithm that can increase the performance of a weak classifier

Nearest Neighbor

Approach:
- Compute neighborhood for each data record
- Analyze neighborhood to determine if the record is an anomaly
Nice and simple, and can be used in unsupervised or semi-supervised settings
May fail if normal points do not have a sufficient number of neighbors
Computationally expensive
As dimensionality increases, data becomes sparse and similarity/distance may not be meaningful
Assumes normal points are ‘close’ to neighbors
- Distance:
  - Point is an outlier if more than $p$ (proportion) of the $k$ nearest neighbours are more than $d$ distance away from the point
  - Not suitable for datasets that have modes with varying density
- Density:
  - Compute local densities in particular regions; declare instances in low-density regions as anomalies
- Local outlier factor (LOF)
  - For each data point $q$ :
  - Compute distance to the $k$ -th nearest neighbor
  - Compute reachability distance with respect to data example $p$ : $\text{reach-dist}(k, p) = \max{\text{k-distance}(p), d(p, q)}$
  - Compute local reachability density: inverse of average reachability based on $MinPts$ nearest neighbors of $q$ : $\text{lrd}(q) = \frac{MinPts}{\Sum{\text{reach-dist}_{MinPts}(q, p)}}$
  - Compute local outlier factor as ratio of average local reachability density of $q$ ’s $k$ nearest neighbors and the local reachability density of $q$ : $\text{LOF}(q) = frac{1}{MinPts} \Sum{p}{\frac{\text{lrd}(p)}{\text{lrd}(q)}}$
- Connectivity outlier factor (COF)
  - Points are connected to their closest neighbor(s?), and are connected to others by the connctivity of the points they are associated with; these recursive relationships form chains
  - Outliers have a greater average distance between points in a chain than the average chaining distance of their nearest neighbours
- Multi-granularity deviation factor (LOCI)
  - Neighbourhood size computed; outliers are points whose neighborhood size are significantly different with respect to their neighbours
  - Finds both outlying points and outlying micro-clusters

Clustering-based Techniques

Normal points belong to large and dense clusters; anomalies do not belong to any, or form very small clusters
Semi-supervised
- Cluster normal data to create modes of normal behavior; anomalies are instances that do not belong to any clusters
Unsupervised
- Post-processing required to determine cluster size and max distance for a point to be classified as an anomaly
Easily adaptable to online/incremental mode; suitable for anomaly detection from temporal data
Issues:
- Computationally expensive; sledgehammer approach
- Normal points may not form clusters
  - Especially in high-dimensional spaces where data is sparse
FindOut algorithm:
- Remove clusters from the original data to identify outliers
- Transform data into multidimensional signal using wavelet transform:
  - High frequency = rapid distribution change = cluster boundary
  - Low frequency = concentrated data = clusters outliers
- Remove both high and low frequency parts; left with outliers
Cluster-based local outlier factor (CBLOF)
- Squeezer clustering algorithm
- If record lies in a ‘small’ cluster, CBLOF is product of cluster size and distance to closest larger cluster
- If record lies in large cluster, CBLOF is product of cluster size and distance to center of cluster

Statistical Techniques

Model data as stochastic distribution: outliers are points which do not fit the model well
Difficult to estimate distributions in high-dimensions
Parametric:
- Assume normal data generated from an underlying parametric distribution
- Learn the parameters of the distribution from the normal sample
- Determine the likelihood of a test point being generated from that distribution
Non-parametric
- Parametric assumptions often do not hold for real data sets
- Use non-parametric techniques, assuming no knowledge of parameters (e.g. parzen window estimation
- SmartShifter
  - Finite mixtures
  - Assign a score to each data point on by measuring how much the model changes after learning that data point

Other techniques

Information theory:

Idea: outliers significantly alter the information content in a dataset
- Measure of information include entropy and relative entropy
- The measure must be sensitive enough to detect irregularities induced by very few outliers
- Can operate unsupervised
Kolmogorov complexity:
- Determine smallest data subset whose removal leads to the maximal reduction in Kolmogorov complexity
Entropy-based approaches:
- Find the K-sized subset whose removal leads to the maximial reduction in entropy

Spectral techniques:

Eigen decomposition:
- Find a set of attributes that captures the bulk of variability (identifying features); these sets of attributes can explain normal data well, but not the outliers
- Assumes that anomalies and normal instances can be distinguished in the reduced space
- Can work unsupervised
Principal component analysis (PCA):
- Compute principal components of the data sets
- Top components capture the variability in normal data
- Smallest principal components have constant values
- Assumes that outliers have variability in the smallest component
Temporal analysis of dynamic graphs
- PCA:
  - At each time step, compute the principal component
  - Stack principal components over time to form a matrix
  - The left singular vector captures normal behavior
    - Singular vector: opposite of eigen-vectors; the direction of maximum action
  - For any time step, the angle between the principal component and the singular vector gives the degree of anomaly
- Matrix approximation-based methods
  - Approximate matrix using CUR decomposition
    - Set of three matrices whose product is similar to the input matrix
  - Track the approximation error over time; high errors imply outliers

Visualization-based techniques:

Human-in-the-loop
- Humans have a maximum attention span; continuous monitoring not possible
- May be good to use after an alert sounds
Use visualization tools to observe the data and visualize anomalies
Provides views of the data for manual inspection
Does not work well for high-dimensional data

Contextual Anomaly Detection

Identify a context around a data instance using a set of contextual attributes
- Spatial (e.g. coordinates)
- Graph context (e.g. edges/weights)
- Sequential context (e.g. position/time)
- Profile (e.g. user demographics)
Determine if the instance is an anomaly within that context
Challenges: identifying a good set of contextual attributes
- e.g. for time: public holiday, weekend/week day, large events
Point anomaly detection:
- Segment data points into contexts
- Apply normal point outlier techniques within that context
- Use structure in the data
  - Build models from the data using contextual attributes (e.g. ARIMA for time series)
Conditional anomaly detection:
- Gaussian distributions:
  - For environmental/contextual attributes
  - For indicator/behavioral attributes
  - Combine to create an outlier score for each data instance: likelihood of the behavioral attributes being generated given the contextual attributes
Collective anomaly detection:
- Sequential: detect sequences
- Spatial: detect sub-regions
- Graph: detect sub-graphs
Sequential anomaly detection
- Detect anomalous sequences within a database of sequences
- Or an anomalous sub-sequences within a sequence
- Sequence time delay embedding (STIDE)
  - Assumes training data contains normal sequences
  - Training:
    - Extract fixed-length sub-sequences using a sliding window
    - Maintain counts for all sub-sequences observed in the training data
  - Testing
    - Extract the same-length sub-sequences from the test sequence
    - Find the probability of each sub-sequence; declare anomalous if below a threshold
    - Declare sequence anomalous if many sub-sequences are anomalous

Online Anomaly Detection

Continuous ingestion of data
Need to update ‘normal behavior’ profile dynamically
Key idea: update the normal profile with data records that have a low anomaly score
Allows slow shifts in normal profile

Distributed Anomaly Detection

Data coming from multiple different sources
Failures that occur at multiple locations simultaneously may be undetected when each looks at their own data in isolation
Simple data exchange:
- Central location receives, merges and processes all the data
Distributed nearest neighbor approaches:
- Exchange one data record per distance computation - computationally inefficient
- Allows for privacy-preserving anomaly detection models: distances, not raw data, sent
Exchange of models:
- Identify modes of normal behavior and describe them with models
- Share the models between locations and combine them at each location

Intrusion detection:

Mostly signature-based: cannot detect emerging threats
Data mining can be used to detect:
- Signature-resistant attacks
- Stealthy intrusions
- Emerging attacks
- Distributed/coordinated attacks
Data mining approaches:
- Misuse detection:
  - Build predictive models from labeled data sets to identify known intrusions
  - High accuracy for known attacks
- Anomaly detection:
  - Detect deviations from normal behavior
  - High false-alarm rate, but can detect novel attacks

AI Use Cases

Examples:

Fujitsu malware intrusion detection
- Time series log data -> graph structured data -> tensor expressions -> deep learning
- Used to determine relationship between tensor expressions and over time
Password guessing: Passgan: a deep learning approach for password guessing
DeepLocker: evasive malware; triggered by specific events (e.g. facial recognition, location)
AI-Based Cyber Threat Landscape: A Survey:
- Reconnaissance (AI-targeted): learn target’s standard behavior
- Weaponization (AI-aided): adjust attack payload to best match target
- Delivery (AI-concealed)
- Exploit: AI-automated way of finding exploits
- Installation (AI-evolved): self propagation
- Command and control (AI-multilayered): activate destructive payloads, establish multiple paths of attack
- Actions (AI-massive): data extraction, DDoS
ML-based de-anonymization: Blind de-anonymization attacks using social networks
Traffic management: sensors to detect cars, improve traffic management?
- Smart/autonomous cars: possible target for attack?
Predictive maintenance of machinery: predict failures and anomalies using sensor data