Prevention Of Data Leakage By Malicious Web Crawlers

Somarathne, H.P.

Publication:
Prevention Of Data Leakage By Malicious Web Crawlers

Files

Primary MS20904128_Thesis.pdf (1.48 MB)

MS20904128_Thesis_Abs.pdf (279.75 KB)

Type:

Thesis

Date

2021

Authors

Somarathne, H.P.

Abstract

Web crawlers are tools that are used to search for information on the internet in order to access it. Since the beginning of public use of the internet, web crawlers have made it easier for search engines to index the content on the internet. Unfortunately, Web Crawlers can be used for nefarious purposes as well as for legitimate ones. Because of the rising use of search engines and the prioritization of the need to get a higher ranking in the indexing of online sites, the threats posed by web crawlers have expanded significantly. In web crawlers, the robots exclusion standard is the regulating point. It establishes a set of criteria for the approved paths that a web crawler can take. Crawlers, on the other hand, are able to circumvent these restrictions and retrieve information from restricted web pages. Due to this, web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. This has a significant impact on service providers, as web crawlers can collect information that can be used for phishing, spamming, and a variety of other unethical and illegal activities. The purpose of this study is to introduce a unique field of research into the detection and prevention of web crawlers. As a result of the low amount of traffic production, typical crawler detection methods were found to be ineffective at capturing dispersed web crawlers, which was discovered. Specifically, the research combines improved conventional web crawler prevention methods with a novel crawler detection method in which the threshold values are measured. This method adds distributed web crawlers to the restriction list, preventing them from traversing the websites, as well as to the restriction list itself. In order to measure threshold values, the LMT (Long tail threshold model) is being presented as a method of measurement. Furthermore, the detection methodology is built on the basis of the observation of crawler traffic and the identification of unique characteristic patterns of them in order to distinguish them from human-generated traffic, as previously mentioned. A limitation approach is incorporated into the system in order to reduce the influence that a crawler can have on a website.

URI

https://rda.sliit.lk/handle/123456789/2937

Collections

MSc in Cyber Security

Full item page

Publication:
Prevention Of Data Leakage By Malicious Web Crawlers

DOI

Files

Type:

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Publication: Prevention Of Data Leakage By Malicious Web Crawlers

DOI

Files

Type:

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Publication:
Prevention Of Data Leakage By Malicious Web Crawlers