When was the last time someone tried to sell you technology that has no machine learning algorithms in it? Are computing systems not based on machine learning any good still? How can you tell the difference?
Why Machine Learning?
Machine learning is trying to solve a critical problem: how to identify malicious behaviors. But it can also be used by attackers to overcome information security technologies. Most machine learning engines will analyze log files trying to find anomalous patterns to learn from, and better protect the network or the application in the future. It is the traffic with malicious intent that can trick our known and static security heuristics that we’re after.
In addition, it is impossible for humans to sift through the massive quantity of data generated by network logs and other sources of data to identify potential attacks. The number of false alarms that waste the limited resources of the security team is enormous.
On the other hand, machine learning algorithms perform exceptionally well with leveraging log data analysis to accurately identify and classify anomalous behavior or subtle differences that indicate a compromise attempt.
Supervised vs. Unsupervised Machine Learning
For detection, two approaches apply: supervised and unsupervised machine learning.
Supervised learning entails providing the algorithm a “training set” of examples that include pairs of input data and the desired or predetermined output or classification.
In the case of attack detection, the training set includes input data for both benign and malicious behaviors paired with the correct classification or identification. When applied to attack detection, supervised machine learning leverages a rich training set through rigorous analysis of communication attributes such as day/time stamp, duration, path, periodicity (and dozens of others) AND the inter-relationships between these attributes.
When a new and unknown data set is introduced, the algorithm determines whether it contains a record of benign or malicious communication. The learning algorithm will also provide a confidence level of its identification. Security policies can then define the appropriate course of action that are built around the identification and confidence level. Supervised machine learning algorithms are not constrained to recognizing only those patterns found in the training set or even the updated knowledge set and can identify brand new malicious attacks using on the underlying algorithms.
The risk? False negatives! When it doesn’t recognize a new pattern, a bad actor will be let through.
In unsupervised learning, there is no training set and no predetermined identification. As it consumes data, the algorithm typically looks for common patterns so it can create clusters. To keep it simple – normal vs. anomalous behavior. For example, communication to unusual site in non-standard time of the day. The algorithm provides indicators for detected anomalies. These anomalies might relate to malicious or benign communication and may well require additional effort to thoroughly investigate and characterize.
The risk? False Positives! Since unsupervised learning lacks guidance, there is a chance it will block real users from accessing the required resource.
If a team has enough security experts and data scientists, a security tool focused on unsupervised machine learning may be the right choice. Unfortunately, most teams aren’t so privileged.
The Best of Both Worlds
Semi-supervised deep learning algorithms cluster behaviors by commonalities (unsupervised), while receiving some guidance (supervised). The key to success is exposure to as much data as possible, so the engine can improve its self-tuning capabilities (feedback loops) based on as many variations of behaviors. In addition, a big data lake facilitates accounting more and more indicators and parameters required to eventually make classification as accurate as possible.
The best example is the detection of sophisticated bots. When a bot can mimic human behavior and bypass security controls, it is classified as a legitimate user. Such a bot can generate non-linear mouse movements and keystrokes. The supervised algorithm knows that such behavior is legitimate, and will let the bot in. The unsupervised will add this behavior to the cluster of common behaviors without knowing if it’s good or bad. Next, it might block a real user. If it happens time and again, this source will be blocked at a certain point. This is called shallow learning.
Some attackers know that, and don’t launch the attack until the algorithm clears their source. Therefore, as some bots distribute their activities over multiple sessions, correlation of activities and violations over time is key to store in the data lake and leverage for decision making. Was there enough data, perhaps it had been classified otherwise. For example, source IP, device ID, TPS (transactions per second), and response to challenges are some indicators that help determine whether this is a bot, blocking the first transaction.
5 Questions to Ask
When evaluating machine learning solutions, be sure to ask the following questions:
- What does it learn?
- What does it avoid learning?
- How quickly will it adapt to changes?
- Can it self-tune automatically?
- How rich is the data it collects?
Don’t be naïve. Learn more.