Phishing is one of the most common online security threats. A phishing website tries to mimic a legitimate page in order to obtain sensitive data such as usernames, passwords, or financial and health-related information from potential victims.
Machine learning (ML) algorithms have been used to detect phishing websites, as a complementary approach to signature matching and heuristics. They usually rely on a set of “domain knowledge” features, for example, the number of days the security certificate in the header is valid, the number of domains under the certificate, the host information, etc. However, many of the domain knowledge features are not available for inline processing, and they can be easily circumvented by sophisticated attackers.
The following GIF file shows more examples of the images generated by the HTML encoder. We should keep in mind that our objective is not to generate realistic images from the HTML content. Instead, it is to learn the suitable HTML representation that will be used to train the classification model for phishing detection.
Classification – phishing or not
Once we generate a suitable numerical representation (a vector of numbers) from the HTML content of a web page using the HTML encoder, we then combine it with the embedding of the URL string characters. The resulting numerical values are used as input features and fed into a neural network for final classification. We have collected millions of known phishing web pages and benign pages to train the binary classification model. Since we don’t keep the encoder parameters frozen, the HTML encoder will be fine-tuned toward phishing classification. The trained classifier will determine whether a new web page is phishing or not.
Netskope Threat Protection
The patented phishing website classifier is now part of Netskope Threat Protection, a comprehensive, multi-layered threat protection system powered by AI and machine learning. It enables us to block phishing web pages in real time, because it only needs the page URL string and the HTML content as input, which is readily available in the web traffic that goes through the Netskope secure access service edge (SASE) platform. The phishing classifier has the capability to detect unknown and zero-day phishing attacks, complementing other heuristic and signature-based engines. This classifier has been optimized to scan web pages inline, with an average runtime of less than 10 milliseconds.
To learn more about the multiple layers of threat capabilities that deliver comprehensive threat protection for cloud and web services, please visit Netskope Threat Protection.
The authors would like to acknowledge the significant contributions from Senior Research Scientist Najmeh Miramirkhani on this project.