About

Q&A

Who is the team?

We are a group of final year students pursuing the DISM in Singapore Polytechnic. Our team comprises of Edison, Isaac, Aldrich, Ryan and Kaizer. Collaboratively, the team engaged with our sponsor, DSO National Laboratories to address the problem statement presented for our Final Year Project, "Applying large language models to source code bug finding".

What is the team's goal?

The original objective of the project aimed to use OpenAI's GPT-3.5-Turbo^ⓘ to detect bugs in source code. However, after additional research and consideration, the team and sponsors decided to broaden the scope of the project towards Training and Evaluating Open-Source LLMs for Deep Learning-based Vulnerability Detection. This shift was due to the constraints of closed-source LLMs such as OpenAI's GPT-3.5-Turbo, which included financial costs, inability to tweak model parameters, and lack of support for fine-tuning and transfer learning.

Why did the team choose this project?

Having a quality dataset is crucial for training any Artificial Intelligence model as it can vastly impact a model's overall performance. However, the current dataset curation methods do not have a efficient and accurate way of curating quality dataset specifically for vulnerability detection. Thus, the team decided to created an automated tool that can alleviate the burden of manually curating quality datasets of vulnerable functions that can be used for Vulnerability Detection.

How did the team go about the project?

The team proposed and created a Vulnerability Fixing Commit (VFC) Classifier which utilizes Natural Language Processing (NLP) to process and classify VFCs and non-VFCs. The utilization our VFC Classifier has enabled the team to make use of codes on open source repositories on GitHub to classify commit messages and extract the respective functions that was attributing to the vulnerability.

Our VFC Classifier, is capable of classifying over 600,000 commits and extracting 500,000 functions within 2 hours. For reference, this is 4000 times faster than the current state-of-the-art Devign dataset which took 600 hours to manually label 50,000 commits and 30,000 functions.

With the dataset curated using our VFC Classifier, we trained our very own vulnerability detection model and achieved an accuracy and F1 score of 89%. Our vulnerability detection model has shown significant improvement compared on past models (GPT-2, CodeT5, CodeBERT) with a higher average F1 of 54%. This notable improvement is mainly attributed to the wide range of vulnerabilities that was extracted and classified from the different repositories using our VFC classifier.

What does the team contribute as a whole?

The team released a dataset of 3,500 Vulnerability-fixing Commits (VFCs), curated from 10 distinct C/C++ repositories on GitHub that took 40 man-hours of manual labelling. The team enriched this dataset by amassing 4 prominent VFC datasets: BigVul, Devign, CVEfixes, and Linux Kernel CVEs, totaling 36,625 commits. Our compiled dataset is publicly available to help further research.

The team incorporated a fine-tuned StarEncoder model into a VFC classification tool, along with function extraction capabilities. This tool can be used to curate large datasets of vulnerable functions to aid researchers in Deep Learning-based Vulnerability Detection (VD).

About

Q&A

Main

Challenge

Our Work