SecretHunter: A Large-scale Secret Scanner for Public Git Repositories
Author
Abstract

Metadata Discovery Problem - Collaborative software development platforms like GitHub have gained tremendous popularity. Unfortunately, many users have reportedly leaked authentication secrets (e.g., textual passwords and API keys) in public Git repositories and caused security incidents and finical loss. Recently, several tools were built to investigate the secret leakage in GitHub. However, these tools could only discover and scan a limited portion of files in GitHub due to platform API restrictions and bandwidth limitations. In this paper, we present SecretHunter, a real-time large-scale comprehensive secret scanner for GitHub. SecretHunter resolves the file discovery and retrieval difficulty via two major improvements to the Git cloning process. Firstly, our system will retrieve file metadata from repositories before cloning file contents. The early metadata access can help identify newly committed files and enable many bandwidth optimizations such as filename filtering and object deduplication. Secondly, SecretHunter adopts a reinforcement learning model to analyze file contents being downloaded and infer whether the file is sensitive. If not, the download process can be aborted to conserve bandwidth. We conduct a one-month empirical study to evaluate SecretHunter. Our results show that SecretHunter discovers 57\% more leaked secrets than state-of-the-art tools. SecretHunter also reduces 85\% bandwidth consumption in the object retrieval process and can be used in low-bandwidth settings (e.g., 4G connections).

Year of Publication
2022
Date Published
dec
Publisher
IEEE
Conference Location
Wuhan, China
ISBN Number
978-1-66549-425-0
URL
https://ieeexplore.ieee.org/document/10063545/
DOI
10.1109/TrustCom56396.2022.00028
Google Scholar | BibTeX | DOI