The ability to predict—and defend against—malicious activity is something of a holy grail in the realm of security. There are many technologies that seek to do it, and some of them have made inroads. When it comes to blacklisting domains involved in activities like phishing, malware, and spam, however, most of the blacklists and intel feeds in existence rely on reports or observation of evil on the domains. Only after someone has been affected does the malicious infrastructure appear on the feeds. What’s more, if an attack is targeted at an individual organization and that organization doesn’t report the attack to blacklist providers, the rest of the world may never have a chance to learn about infrastructure that could be aimed at others.
No one wants to get hurt or for it to happen to others. But how does one go about accurately predicting that a domain will be malicious? Some take the approach of blocking all young domains on the theory that most newly-registered domains are malicious. There is merit to this idea, but it also has some drawbacks: it will yield some false positives, since there are certainly plenty of innocent domains registered every day. Perhaps more concerning from a security standpoint, threat actors know about this approach, and may start “seasoning” domains–registering them and leaving them dormant until they’re not so young. Any threat actor who takes the long-term view can do this.
At the same time, anyone who has examined malicious domains as part of a threat hunting or forensic response exercise can often get a good sense of whether a domain is dangerous just by looking at certain characteristics. Does its name make sense, or is it a keyboard-smash of characters? Does the registrant information in the domain’s Whois record look rational? Does the IP address for the domain appear on blacklists because other domains hosted there have proved bad? A domain has many attributes that help paint a picture of its propensity for good or evil.
If human analysts could magically apply this kind of scrutiny to every new domain that comes into being, they could create high-fidelity blacklists that could block ahead of time. Magic, however, is in short supply. But here’s the good news: technologies such as machine learning are doing some fairly magical things, and the classification of dangerous domains a priori is one of those things.
In the world of machine learning, things like those attributes of domains (name composition, age, etc.) are called features. A machine learning classifier looks at sets of features in order to determine whether a given entity fits into a particular category. Data scientists “train” machine learning classifiers on various combinations and permutations of features and run the models to see how good the machine is at placing unknown entities into the right classification “buckets.” Some features have more predictive value than others; ultimately, developing a good classifier depends on selecting the best sets of features, and having a data set with a large enough sample size to enable a fine-grained analysis of the entities.
Predicting whether a newly-registered domain is likely to be used for phishing, spam, malware, or neutral purposes is the kind of exercise that lends itself well to machine learning, provided that the data scientists have access to a sufficient pool of domains (including a training set that are known to have been involved with malware/spam/phishing) and that they can identify enough unique features in a given domain to give the machine classifiers something to work with. Data science, combined with a healthy proportion of the world’s approximately 330 million existing domains, holds the promise of giving beleaguered security teams a useful crystal ball. Teams dealing with network defense or with incident response and forensics could make meaningful progress in their battle against malicious infrastructure if armed with the kinds of blacklists that don’t require someone to get hurt before identifying the danger spots.
Literal crystal balls are the stuff of fantasy, of course. We don’t have, nor will we ever have, perfect vision into the future. But there are areas of the everyday security battle in which predictive technologies can make an important difference. Threat actors never rest; the good news is that science never rests, either.