FloCon 2020 has ended
Back To Schedule
Thursday, January 9 • 9:30am - 10:00am
Code Similarity Detection Using Syntax-Agnostic Locality Sensitive Hashing

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Maintaining software security as the volume of new code written increases is a pressing "big data" problem. Once a vulnerability is identified in one piece of software, identifying other software that might contain a similar vulnerability is critical. However, conducting this type of search is time-consuming and challenging. In this presentation, we discuss Syntax-agnostic locality sensitive hashing (Syntax-agnostic LSH), an efficient method for finding code with similar functionality in large code repositories. Our approach significantly reduces the amount of time analysts need to identify potentially vulnerable software.

LSH is known to successfully find near-duplicate documents at scale. It is also proven in applications such as audio/video//image searching, entity resolution, and fingerprint comparison. Applying LSH to software results in fast searching as it compresses code segments into hashes and eliminates the need for pairwise comparisons by clustering similarly hashed code segments together. Because we hash on the semantic meaning of code segments rather than the code itself, our variant of LSH handles varying code writing styles and compilation strategies that can cause code with the same functionality to look syntactically different.

The use of Syntax-Agnostic LSH as a code similarity detection and searching capability reduces the time, effort, and cost of debugging and maintaining software and allows us to be one step ahead of attackers. Our approach is both an investigative and preventative tool. It allows for much faster identification of code with both technical and logical vulnerabilities that need to be fixed, and it encourages the reuse of “repaired” code through its ability to search for code segments by functionality, rather than syntax. Our cyber team has incorporated Syntax-Agnostic LSH into its investigative platform, with the expectation that it will decrease the length of investigations from 3-4 weeks to under a week.

Attendees Will Learn:
Attendees will learn how to better maintain the security of large codebases through investigative and preventative means.

avatar for Lara Dedic

Lara Dedic

Machine Learning Researcher, Novetta
Lara Dedic is an Applied Machine Learning Researcher at Novetta, an advanced analytics company headquartered in McLean, VA. Lara focuses on applying machine learning methods from natural language processing (NLP), computer vision, and other domains to cybersecurity.

Thursday January 9, 2020 9:30am - 10:00am EST
Regency Ballroom Hyatt Regency Savannah 2 W. Bay Street Savannah GA 31401