Background
Types of Anomaly Detection
- Contextual
- Collective
- Online
- Distributed
- Point
- Classification
- Rule
- Neural networks
- SVM
- Nearest Neighbor
- Density
- Distance
- Clustering
- Statistical
- Parametric
- Non-parametric
- Other
- Information theory
- Spectral decomposition
- Visualization
- Classification
Neural Networks
- 1990s:
- Neural networks with very few (e.g. 3) layers
- Too limited/sensitive, required lots of manual tweaking
- Linear classifier: cannot learn XOR function
- Deep learning: hundreds of layers
- Accuracy high enough for commercial use
- Convolutions neural networks
- Between layers, convolutions applied and sub-sampling occurs
- ML vs DL:
- ML: feature extraction done by humans
- DL: relevant features automatically extracted from input
- CNNs: repeat (convolution, sub-sampling) steps for a few layers, then form full connections and Gaussian connections before outputting results.
Introduction
Too much data, not enough analysis.
Anomalous events occur relatively infrequently, but when they do occur, their consequences can be devastating.
What are anomalies?
- Patterns in the data that do not conform to the expected behavior
- Patterns/outliers/exceptions/peculiarities
- Often translates to real-life entities
- Credit card fraud
- Cyber intrusions
Related problems:
- Rare class mining
- Chance discovery
- Novelty detection
- Exception mining
- Noise removal
- Black swan
- Catastrophic events that are so rare that even the possibility that it might occur is unknown
- Often obvious in hindsight
Challenges:
- Defining a representative ‘normal’ region
- The boundary between normal and outlying behavior is often not precise
- The notion of an outlier varies between domains
- What is normal can change over time
- Requires labeled data for training/validation
- Malicious adversaries
- Noisy data
Aspects:
- Nature of input data
- Univariate/multivariate
- Attribute nature:
- Binary (e.g. attack or not)
- Categorical (e.g. IP range)
- Continuous (e.g. duration)
- Hybrid
- Relationships:
- Sequential/temporal
- Spatial
- Spatio-temporal
- Graph
- Availability of human supervision:
- Supervised:
- Labels available for both normal data and anomalies
- Normal/anomaly label
- Similar to rare class mining
- Semi-supervised anomaly:
- Labels only available for normal
- Unsupervised anomaly detection:
- No labels; based on the assumption that anomalies are both very rare and very obvious
- Shifting domains:
- What was normal a year ago may no longer be true
- How do you ensure your classifier stays up to date
- Supervised:
- Anomaly type: point, contextual, structural
- Point:
- Is an individual data point anomalous?
- Contextual:
- Is an individual data instance anomalous within a context
- e.g. daily temperature: what is normal in summer may be an anomaly in winter
- e.g. maybe there is a daily spike in traffic every day at 10 pm
- Collective:
- A collection of related instances is anomalous
- e.g. temporal data, data points stay constant when there is usually significant variance
- A collection of related instances is anomalous
- Point:
- Output of anomaly detection
- Label:
- Is the test instance normal or an anomaly?
- Score:
- How anomalous is the test instance
- Allows ranking and thresholding
- How anomalous is the test instance
- Label:
- Evaluation of anomaly detection techniques
- Accuracy alone is not a sufficient metric
- If events are rare, outputting all events as normal gives you very high accuracy
- Like having a security guard: security breaches do not happen often, but it is critical that they respond properly when a breach occurs
- Confusion matrix: false/true positives/negatives (FP, FN, TP, TN)
- Recall:
R = TP/(TP + FN)- The proportion of anomalies correctly detected as anomalies
- Precision:
P = TP/(TP + FP)- The proportion of detected anomalies that were real
- F-measure:
- A single measure combining recall and precision
2RP(R+P)
- False alarm rate:
- Proportion of normal instances classified as anomalies
FP/(FP + TN)
- ROC curve:
- Trade-off between detection rate and false alarm
- As detection rate increases, number of false alarms also increases
- Area under the ROC curve (AUC): compute using trapezoid rule
- Ideal ROC curve has area of 1: 100% detection rate with zero false alarms
- Accuracy alone is not a sufficient metric
Applications:
- Network intrusion detection
- Intrusions: attempts to bypass security mechanism
- Challenges:
- Traditionally signature-based: systems detect signatures of known attacks and hence cannot detect emerging threats
- Fraud detection
- Credit card, insurance, mobile/cell fraud
- Fraud could be an actual customer or be through identity theft
- High mis-classification cost
- Insider trading
- Credit card, insurance, mobile/cell fraud
- Healthcare informatics/medical diagnostics
- Detect anomalous patient records: disease outbreaks, instrumentation errors, hacked records
- Challenges:
- Only normal labels available
- Equipment, processes, rules etc. varies between countries: normal differs
- High mis-classification cost
- Complex, spatio-temporal relationships
- Industrial damage detection
- Detect faults/failures in systems, structural/mechanical damage etc.
- Very large, noisy and unlabeled data set
- Typically requires immediate intervention
- Image processing/video surveillance
- Detect outliers in images over time
- e.g. unattended bag in airport
- Detect anomalous regions within an image
- e.g. mammography image analysis
- Detect outliers in images over time
- Novel topic detection in text mining
Point Anomaly Detection
Classification-based Techniques
Main idea: build a classification model for normal/anomalous events based on labeled training data.
Models must be able to handle skewed/imbalanced class distributions: anomalous events are rare by definition.
Must be able to handle skewed/imbalanced class distributions: anomalous events are rare.
Categories:
- Supervised classification
- Both normal and anomaly classes labeled
- Subscribe to many sources to get examples of new anomalies as they appear in the world; hard to get good data set of anomalies within a single organization
- Build classifier to distinguish between normal and anomalous behavior
- High accuracy
- Cannot detect unknown and emerging anomalies
- Both normal and anomaly classes labeled
- Semi-supervised classification
- Requires only labels from the normal class
- Detect deviations from normal behavior
- Possible high false alarm rate - previously unseen but legitimate data records may be incorrectly categorized
Techniques:
- Manipulating data records
- Oversampling the rare class
- Make duplicates of the rare class to balance out the two classes
- Does not increase information but increases the mis-classification cost
- Undersampling the majority class
- Techniques: random, near-miss examples, examples far from decision boundaries
- Results in general information loss and overly general rules
- Generating artificial anomalies
- Synthetic minority over-sampling techniques: generate rare class examples inside regions or near the edges of existing rare class examples
- Oversampling the rare class
- Rule-based
- Adapting existing rule-based techniques:
- Adapting multi-class classification methods to a single-class classification
- Association rules
- Rules with support higher than some pre-specified threshold may characterize normal behavior
- Case specific feature weighting
- Basically any classifier will require you to set some sort of threshold
- Decision tree learning: each rare class test examples replaces the global weight vector with a dynamically generated weight vector that depends on the path taken by that example
- Case specific rule weighting:
- LERS: learning from examples based on rough sets; increases rule strength for rules describing the rare class
- PN-rule learning
- P-phase:
- Create rules that cover most positive examples with high support
- Phase focuses on gaining high recall
- N-phase
- Add rules which remove the false positives
- Phase focuses on high accuracy and support
- P-phase:
- CREDOS
- Graph theory approach: create decision tree
- Ripple down rules: overfit the training data
- Create binary tree with a rule on each node
- Overfit the training data
- Then prune the decision tree to generalize
- Adapting existing rule-based techniques:
- Model-based:
- Deep learning/Neural networks
- Multi-layer perceptrons
- Measure activation of output nodes
- Extend learning beyond decision boundaries
- Auto-associative neural networks
- Replicator neural network
- Multi-layer, feed-forward neural network
- Same number of input and output nodes; variables should match between input and output
- Hidden layers have fewer nodes, forcing the RNN to come up with a compressed representation of the data
- Measure outlyingness by checking the reconstruction error of each data point:
- Model is designed to compress normal data, so anomalies will not be compressed as well
- Hopfield networks
- Replicator neural network
- Radial basis functions
- Can interpolate between sparse data points
- Adds reverse connection from output to central layer
- Each neuron has an associated normal distribution; any new instance that does not fit any of the distributions is an anomaly
- Multi-layer perceptrons
- Support vector machines
- Normal data records belong to high-density data regions
- Use unsupervised approach to learn high/low density data regions
- Use SVM to classify the data density level and hence detect anomalies
- Tensor calculus
- Transform data points into a different (higher-dimensional) space to maximize the distance between the two categories
- Linear; uses hyperplanes for classifications
- RBF: radial basis function kernel
- Computes similarity between two points
-
-
when distance is 0 -
is the variance and SVM hyperparameter
-
- As variance decreases, number of support vectors increases, making the separating surface more complex
- Bayesian networks
- Typical: aggregate information from different variables to estimate the probability that events belongs to the normal/anomalous class
- Naive Bayesian classifiers
- Incorporates prior probabilities to classify events as normal/anomalous
- Psuedo-Bayes estimators
- I stage: learn prior, posterior of unseen anomalies from training data
- II stage: use naive Bayesian classifiers to classify instances into normal instances, known anomalies and new anomalies
- Hidden Markov model
- Deep learning/Neural networks
- Cost-sensitive classification
- Thresholding based on possible cost to the organization
- Ensemble:
- Boosting: learning algorithm that can increase the performance of a weak classifier
Nearest Neighbor
- Approach:
- Compute neighborhood for each data record
- Analyze neighborhood to determine if the record is an anomaly
- Nice and simple, and can be used in unsupervised or semi-supervised settings
- May fail if normal points do not have a sufficient number of neighbors
- Computationally expensive
- As dimensionality increases, data becomes sparse and similarity/distance may not be meaningful
- Assumes normal points are ‘close’ to neighbors
- Distance:
- Point is an outlier if more than
(proportion) of the nearest neighbours are more than distance away from the point - Not suitable for datasets that have modes with varying density
- Point is an outlier if more than
- Density:
- Compute local densities in particular regions; declare instances in low-density regions as anomalies
- Local outlier factor (LOF)
- For each data point
: - Compute distance to the
-th nearest neighbor - Compute reachability distance with respect to data example
: - Compute local reachability density: inverse of average reachability based on
nearest neighbors of : - Compute local outlier factor as ratio of average local reachability density of
’s nearest neighbors and the local reachability density of :
- For each data point
- Connectivity outlier factor (COF)
- Points are connected to their closest neighbor(s?), and are connected to others by the connctivity of the points they are associated with; these recursive relationships form chains
- Outliers have a greater average distance between points in a chain than the average chaining distance of their nearest neighbours
- Multi-granularity deviation factor (LOCI)
- Neighbourhood size computed; outliers are points whose neighborhood size are significantly different with respect to their neighbours
- Finds both outlying points and outlying micro-clusters
- Distance:
Clustering-based Techniques
- Normal points belong to large and dense clusters; anomalies do not belong to any, or form very small clusters
- Semi-supervised
- Cluster normal data to create modes of normal behavior; anomalies are instances that do not belong to any clusters
- Unsupervised
- Post-processing required to determine cluster size and max distance for a point to be classified as an anomaly
- Easily adaptable to online/incremental mode; suitable for anomaly detection from temporal data
- Issues:
- Computationally expensive; sledgehammer approach
- Normal points may not form clusters
- Especially in high-dimensional spaces where data is sparse
- FindOut algorithm:
- Remove clusters from the original data to identify outliers
- Transform data into multidimensional signal using wavelet transform:
- High frequency = rapid distribution change = cluster boundary
- Low frequency = concentrated data = clusters outliers
- Remove both high and low frequency parts; left with outliers
- Cluster-based local outlier factor (CBLOF)
- Squeezer clustering algorithm
- If record lies in a ‘small’ cluster, CBLOF is product of cluster size and distance to closest larger cluster
- If record lies in large cluster, CBLOF is product of cluster size and distance to center of cluster
Statistical Techniques
- Model data as stochastic distribution: outliers are points which do not fit the model well
- Difficult to estimate distributions in high-dimensions
- Parametric:
- Assume normal data generated from an underlying parametric distribution
- Learn the parameters of the distribution from the normal sample
- Determine the likelihood of a test point being generated from that distribution
- Non-parametric
- Parametric assumptions often do not hold for real data sets
- Use non-parametric techniques, assuming no knowledge of parameters (e.g. parzen window estimation
- SmartShifter
- Finite mixtures
- Assign a score to each data point on by measuring how much the model changes after learning that data point
Other techniques
Information theory:
- Idea: outliers significantly alter the information content in a dataset
- Measure of information include entropy and relative entropy
- The measure must be sensitive enough to detect irregularities induced by very few outliers
- Can operate unsupervised
- Kolmogorov complexity:
- Determine smallest data subset whose removal leads to the maximal reduction in Kolmogorov complexity
- Entropy-based approaches:
- Find the K-sized subset whose removal leads to the maximial reduction in entropy
Spectral techniques:
- Eigen decomposition:
- Find a set of attributes that captures the bulk of variability (identifying features); these sets of attributes can explain normal data well, but not the outliers
- Assumes that anomalies and normal instances can be distinguished in the reduced space
- Can work unsupervised
- Principal component analysis (PCA):
- Compute principal components of the data sets
- Top components capture the variability in normal data
- Smallest principal components have constant values
- Assumes that outliers have variability in the smallest component
- Temporal analysis of dynamic graphs
- PCA:
- At each time step, compute the principal component
- Stack principal components over time to form a matrix
- The left singular vector captures normal behavior
- Singular vector: opposite of eigen-vectors; the direction of maximum action
- For any time step, the angle between the principal component and the singular vector gives the degree of anomaly
- Matrix approximation-based methods
- Approximate matrix using CUR decomposition
- Set of three matrices whose product is similar to the input matrix
- Track the approximation error over time; high errors imply outliers
- Approximate matrix using CUR decomposition
- PCA:
Visualization-based techniques:
- Human-in-the-loop
- Humans have a maximum attention span; continuous monitoring not possible
- May be good to use after an alert sounds
- Use visualization tools to observe the data and visualize anomalies
- Provides views of the data for manual inspection
- Does not work well for high-dimensional data
Contextual Anomaly Detection
- Identify a context around a data instance using a set of contextual attributes
- Spatial (e.g. coordinates)
- Graph context (e.g. edges/weights)
- Sequential context (e.g. position/time)
- Profile (e.g. user demographics)
- Determine if the instance is an anomaly within that context
- Challenges: identifying a good set of contextual attributes
- e.g. for time: public holiday, weekend/week day, large events
- Point anomaly detection:
- Segment data points into contexts
- Apply normal point outlier techniques within that context
- Use structure in the data
- Build models from the data using contextual attributes (e.g. ARIMA for time series)
- Conditional anomaly detection:
- Gaussian distributions:
- For environmental/contextual attributes
- For indicator/behavioral attributes
- Combine to create an outlier score for each data instance: likelihood of the behavioral attributes being generated given the contextual attributes
- Gaussian distributions:
- Collective anomaly detection:
- Sequential: detect sequences
- Spatial: detect sub-regions
- Graph: detect sub-graphs
- Sequential anomaly detection
- Detect anomalous sequences within a database of sequences
- Or an anomalous sub-sequences within a sequence
- Sequence time delay embedding (STIDE)
- Assumes training data contains normal sequences
- Training:
- Extract fixed-length sub-sequences using a sliding window
- Maintain counts for all sub-sequences observed in the training data
- Testing
- Extract the same-length sub-sequences from the test sequence
- Find the probability of each sub-sequence; declare anomalous if below a threshold
- Declare sequence anomalous if many sub-sequences are anomalous
Online Anomaly Detection
- Continuous ingestion of data
- Need to update ‘normal behavior’ profile dynamically
- Key idea: update the normal profile with data records that have a low anomaly score
- Allows slow shifts in normal profile
Distributed Anomaly Detection
- Data coming from multiple different sources
- Failures that occur at multiple locations simultaneously may be undetected when each looks at their own data in isolation
- Simple data exchange:
- Central location receives, merges and processes all the data
- Distributed nearest neighbor approaches:
- Exchange one data record per distance computation - computationally inefficient
- Allows for privacy-preserving anomaly detection models: distances, not raw data, sent
- Exchange of models:
- Identify modes of normal behavior and describe them with models
- Share the models between locations and combine them at each location
Intrusion detection:
- Mostly signature-based: cannot detect emerging threats
- Data mining can be used to detect:
- Signature-resistant attacks
- Stealthy intrusions
- Emerging attacks
- Distributed/coordinated attacks
- Data mining approaches:
- Misuse detection:
- Build predictive models from labeled data sets to identify known intrusions
- High accuracy for known attacks
- Anomaly detection:
- Detect deviations from normal behavior
- High false-alarm rate, but can detect novel attacks
- Misuse detection:
AI Use Cases
Examples:
- Fujitsu malware intrusion detection
- Time series log data -> graph structured data -> tensor expressions -> deep learning
- Used to determine relationship between tensor expressions and over time
- Password guessing: Passgan: a deep learning approach for password guessing
- DeepLocker: evasive malware; triggered by specific events (e.g. facial recognition, location)
- AI-Based Cyber Threat Landscape: A Survey:
- Reconnaissance (AI-targeted): learn target’s standard behavior
- Weaponization (AI-aided): adjust attack payload to best match target
- Delivery (AI-concealed)
- Exploit: AI-automated way of finding exploits
- Installation (AI-evolved): self propagation
- Command and control (AI-multilayered): activate destructive payloads, establish multiple paths of attack
- Actions (AI-massive): data extraction, DDoS
- ML-based de-anonymization: Blind de-anonymization attacks using social networks
- Traffic management: sensors to detect cars, improve traffic management?
- Smart/autonomous cars: possible target for attack?
- Predictive maintenance of machinery: predict failures and anomalies using sensor data