Keynote Speaker
University of California, Irvine, USA
Keynote Title: The Curious Case of Code Duplication in GitHub
Previous studies have shown that there is a non-trivial amount of
duplication in source code. We recently analyzed a corpus of 2.6 million
non-fork projects hosted on GitHub representing over 258 million files
written in Java, C++, Python and JavaScript, and found a large amount of
duplication, much more than we anticipated. This finding made us be much
more careful when using open source repositories for drawing statistical
conclusions, especially now -- in the age of machine learning. In this
talk, I will present our GitHub study, and will briefly cover some of
our most recent work on extending duplicate detection to the machine
learning models themselves.