Publication: Prevention Of Data Leakage By Malicious Web Crawlers
DOI
Type:
Thesis
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Web crawlers are tools that are used to search for information on the internet in order to
access it. Since the beginning of public use of the internet, web crawlers have made it easier
for search engines to index the content on the internet. Unfortunately, Web Crawlers can be
used for nefarious purposes as well as for legitimate ones. Because of the rising use of search
engines and the prioritization of the need to get a higher ranking in the indexing of online
sites, the threats posed by web crawlers have expanded significantly. In web crawlers, the
robots exclusion standard is the regulating point. It establishes a set of criteria for the
approved paths that a web crawler can take. Crawlers, on the other hand, are able to
circumvent these restrictions and retrieve information from restricted web pages. Due to this,
web crawlers can collect information that can be used for phishing, spamming, and a variety
of other unethical and illegal activities. This has a significant impact on service providers, as
web crawlers can collect information that can be used for phishing, spamming, and a variety
of other unethical and illegal activities. The purpose of this study is to introduce a unique
field of research into the detection and prevention of web crawlers. As a result of the low
amount of traffic production, typical crawler detection methods were found to be ineffective
at capturing dispersed web crawlers, which was discovered. Specifically, the research
combines improved conventional web crawler prevention methods with a novel crawler
detection method in which the threshold values are measured. This method adds distributed
web crawlers to the restriction list, preventing them from traversing the websites, as well as
to the restriction list itself. In order to measure threshold values, the LMT (Long tail threshold
model) is being presented as a method of measurement. Furthermore, the detection
methodology is built on the basis of the observation of crawler traffic and the identification
of unique characteristic patterns of them in order to distinguish them from human-generated
traffic, as previously mentioned. A limitation approach is incorporated into the system in
order to reduce the influence that a crawler can have on a website.
