Prevention Of Data Leakage By Malicious Web Crawlers

Somarathne, H.P.

Publication:
Prevention Of Data Leakage By Malicious Web Crawlers

dc.contributor.author	Somarathne, H.P.
dc.date.accessioned	2022-08-24T08:26:17Z
dc.date.available	2022-08-24T08:26:17Z
dc.date.issued	2021
dc.description.abstract	Web crawlers are tools that are used to search for information on the internet in order to access it. Since the beginning of public use of the internet, web crawlers have made it easier for search engines to index the content on the internet. Unfortunately, Web Crawlers can be used for nefarious purposes as well as for legitimate ones. Because of the rising use of search engines and the prioritization of the need to get a higher ranking in the indexing of online sites, the threats posed by web crawlers have expanded significantly. In web crawlers, the robots exclusion standard is the regulating point. It establishes a set of criteria for the approved paths that a web crawler can take. Crawlers, on the other hand, are able to circumvent these restrictions and retrieve information from restricted web pages. Due to this, web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. This has a significant impact on service providers, as web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. The purpose of this study is to introduce a unique field of research into the detection and prevention of web crawlers. As a result of the low amount of traffic production, typical crawler detection methods were found to be ineffective at capturing dispersed web crawlers, which was discovered. Specifically, the research combines improved conventional web crawler prevention methods with a novel crawler detection method in which the threshold values are measured. This method adds distributed web crawlers to the restriction list, preventing them from traversing the websites, as well as to the restriction list itself. In order to measure threshold values, the LMT (Long tail threshold model) is being presented as a method of measurement. Furthermore, the detection methodology is built on the basis of the observation of crawler traffic and the identification of unique characteristic patterns of them in order to distinguish them from human-generated traffic, as previously mentioned. A limitation approach is incorporated into the system in order to reduce the influence that a crawler can have on a website.	en_US
dc.identifier.uri	https://rda.sliit.lk/handle/123456789/2937
dc.language.iso	en	en_US
dc.title	Prevention Of Data Leakage By Malicious Web Crawlers	en_US
dc.type	Thesis	en_US
dspace.entity.type	Publication