Publication:
Prevention Of Data Leakage By Malicious Web Crawlers

dc.contributor.authorSomarathne, H.P.
dc.date.accessioned2022-08-24T08:26:17Z
dc.date.available2022-08-24T08:26:17Z
dc.date.issued2021
dc.description.abstractWeb crawlers are tools that are used to search for information on the internet in order to access it. Since the beginning of public use of the internet, web crawlers have made it easier for search engines to index the content on the internet. Unfortunately, Web Crawlers can be used for nefarious purposes as well as for legitimate ones. Because of the rising use of search engines and the prioritization of the need to get a higher ranking in the indexing of online sites, the threats posed by web crawlers have expanded significantly. In web crawlers, the robots exclusion standard is the regulating point. It establishes a set of criteria for the approved paths that a web crawler can take. Crawlers, on the other hand, are able to circumvent these restrictions and retrieve information from restricted web pages. Due to this, web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. This has a significant impact on service providers, as web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. The purpose of this study is to introduce a unique field of research into the detection and prevention of web crawlers. As a result of the low amount of traffic production, typical crawler detection methods were found to be ineffective at capturing dispersed web crawlers, which was discovered. Specifically, the research combines improved conventional web crawler prevention methods with a novel crawler detection method in which the threshold values are measured. This method adds distributed web crawlers to the restriction list, preventing them from traversing the websites, as well as to the restriction list itself. In order to measure threshold values, the LMT (Long tail threshold model) is being presented as a method of measurement. Furthermore, the detection methodology is built on the basis of the observation of crawler traffic and the identification of unique characteristic patterns of them in order to distinguish them from human-generated traffic, as previously mentioned. A limitation approach is incorporated into the system in order to reduce the influence that a crawler can have on a website.en_US
dc.identifier.urihttps://rda.sliit.lk/handle/123456789/2937
dc.language.isoenen_US
dc.titlePrevention Of Data Leakage By Malicious Web Crawlersen_US
dc.typeThesisen_US
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
MS20904128_Thesis.pdf
Size:
1.48 MB
Format:
Adobe Portable Document Format
Thumbnail Image
Name:
MS20904128_Thesis_Abs.pdf
Size:
279.75 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: