Phishing URL Detection Using RandomForest

Overview

This project aims to detect phishing URLs using a RandomForest classifier. The process involves two main steps: feature extraction from URLs and prediction using a custom trained RandomForest model.

Feature Extraction

When a URL is input into the system, it undergoes a comprehensive feature extraction process. This process analyzes various characteristics of the URL and its associated website to create a feature vector.

The URL features extracted include:

URL having IP: Checks if the URL contains an IP address instead of a domain name.
URL length: Length of the URL string.
URL shortening: Checks if the URL is shortened using services like Bit.ly or TinyURL.
Presence of "@" symbol: Indicates if the URL contains the "@" symbol.
Double slash redirecting: Checks for double slashes in the URL path.
Prefix/suffix domains: Examines the prefix and suffix of the URL for common phishing indicators.
Subdomain and domain features: Extracts features related to subdomains and domain characteristics.
SSL certificate final state: Checks if the SSL certificate of the website is valid.
Domain registration length: Duration since the domain was registered.
Favicon presence: Determines if the website has a favicon.
Port number in URL: Checks for unusual port numbers in the URL.
HTTPS token in URL: Presence of "https" in the URL.
Presence of request URL: Checks for the existence of an external URL request in the webpage.
URL of anchor: Checks for the presence of an anchor element in the URL.
Links in tags: Number of links present within HTML tags.
Suspicious form handling (SFH): Checks if forms are handled securely on the webpage.
Presence of email submission: Indicates if email submission is required.
Abnormal URL redirection: Detects abnormal redirection behavior in the URL.
Mouse over and right-click features: Checks if these features are enabled on the webpage.
Presence of pop-up windows and iframes: Detects the presence of pop-up windows and iframes.
Age of domain: Calculates the age of the domain.
DNS checking: Checks the domain's DNS records.
Web traffic data: Estimates web traffic to the domain.
Google PageRank: Google's ranking of the webpage.
Indexing in Google: Checks if the webpage is indexed by Google.
Links pointing to page: Number of external links pointing to the webpage.
Statistical features: Various statistical features extracted from the URL.

Each of these features is quantified and combined into a single feature vector that represents the URL.

Prediction Using RandomForest

Once the feature vector is created, it's passed to the pre-trained RandomForest classifier:

RandomForest Structure: The classifier consists of multiple decision trees, each trained on a subset of the training data and features.
Decision Process: Each tree in the forest independently classifies the URL based on its feature vector.
Aggregation: The final classification is determined by majority voting among all trees.
Output: The model outputs a binary classification - either "Phishing" or "Legitimate" - along with a confidence score.

RandomForest is particularly effective for this task because:

It can handle the high-dimensional feature space of URL characteristics.
It's robust against overfitting, a common problem in phishing detection due to the constantly evolving nature of attacks.
It can provide feature importance rankings, helping identify which URL characteristics are most indicative of phishing attempts.

By combining thorough feature extraction with the powerful RandomForest algorithm, this project aims to provide accurate and reliable phishing URL detection, helping to enhance online security for users.

Model Performance

We achieved an accuracy of 97.31% in detecting phishing URLs using our RandomForest classifier. This high accuracy demonstrates the effectiveness of our approach in distinguishing between legitimate and phishing websites.

Installation

Clone the repository:

git clone https://github.com/your-username/phishing-url-detection.git
cd phishing-url-detection

Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

Python Script

To run a Python script from your command prompt or terminal, use the following format:
```
python main.py/python3 main.py
```
When prompted, enter the URL you want to check.

The script will output whether the URL is classified as phishing or legitimate.

React Web Application

Navigate to the React app directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Start the application:
```
npm start
```
Open your web browser and go to http://localhost:3000.
Enter a URL in the input field and click the "Check" or "Predict" button.
The application will display whether the URL is classified as "Phishy URL" or "Legitimate URL".
To stop the React application, press Ctrl + C in the terminal where it's running.

RandomForest Classifier

RandomForest is a robust ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. It excels in handling large datasets with high dimensionality and provides insights into feature importance.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Thanks to Akash Kumar for providing the phishing URL dataset.

Future Work

Implement real-time URL checking API
Enhance feature extraction for better accuracy
Develop a browser extension for instant phishing detection

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
datasets		datasets
feature_extraction		feature_extraction
frontend		frontend
generate_dataset		generate_dataset
lib		lib
models		models
predict		predict
test_feature_labels		test_feature_labels
trained_models		trained_models
trash		trash
url_txt_files		url_txt_files
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf
features.txt		features.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing URL Detection Using RandomForest

Overview

Feature Extraction

Prediction Using RandomForest

Model Performance

Installation

Usage

Python Script

React Web Application

RandomForest Classifier

Contributors

Shamanth M Hiremath

Sanchit Vijay

Trijal Shinde

License

Acknowledgments

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection Using RandomForest

Overview

Feature Extraction

Prediction Using RandomForest

Model Performance

Installation

Usage

Python Script

React Web Application

RandomForest Classifier

Contributors

Shamanth M Hiremath

Sanchit Vijay

Trijal Shinde

License

Acknowledgments

Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 2

Languages

Packages