1 Math and Computer Science, Fisk University, Nashville, TN 37208.
2 Computer Science, Fisk University, Nashville, TN 37208.
World Journal of Advanced Engineering Technology and Sciences, 2025, 16(03), 373-385
Article DOI: 10.30574/wjaets.2025.16.3.1349
Received on 11 August 2025; revised on 16 September 2025; accepted on 19 September 2025
Web content moderation faces increasingly diverse threats ranging from phishing and malware to clickbait and fraud. This project, Surf Shelter, proposes a unified risk assessment system that utilizes multi-label classification to detect multiple types of website threats simultaneously. By leveraging big data from sources such as Common Crawl, OpenPageRank, GitHub, VirusTotal, PhishTank, and Google Safe Browsing, we compile a comprehensive dataset of websites and threat intelligence. We extract rich features using natural language processing (DistilBERT embeddings), static code analysis, and security heuristics, and apply an ensemble soft-voting strategy to label websites across threat categories. Preliminary results on 11,500 collected webpages (with 1,000 labeled) show that our model can achieve high overall accuracy (around 84%) in identifying malicious content, though calibration is needed to reduce false positives. An XGBoost classifier outperformed other models in consistency, and a Gaussian Mixture Model (GMM) helped adjust decision thresholds when soft-vote scores indicated misclassifications. The evolving cloud-deployed system demonstrates the feasibility of a one-stop, continuously updating platform for web risk detection. We conclude that a multi-label, data-driven approach can significantly enhance web content safety, and we outline future steps to integrate graph neural networks and deploy a user-facing extension for real-time protection.
Multi-label classification; Web threat detection; Big data analytics; Content moderation; Cybersecurity; Ensemble learning
Preview Article PDF
Aditya Karki, Ayesha Imam and Navaraj Pandey. Surf Shelter: A Big Data-driven Risk Assessment System Using Multi-label Classification. World Journal of Advanced Engineering Technology and Sciences, 2025, 16(03), 373-385. Article DOI: https://doi.org/10.30574/wjaets.2025.16.3.1349.