Keynote Speaker

Chancellor's Professor Cristina Videira Lopes

University of California, Irvine, USA

Keynote Title: The Curious Case of Code Duplication in GitHub


Previous studies have shown that there is a non-trivial amount of duplication in source code. We recently analyzed a corpus of 2.6 million non-fork projects hosted on GitHub representing over 258 million files written in Java, C++, Python and JavaScript, and found a large amount of duplication, much more than we anticipated. This finding made us be much more careful when using open source repositories for drawing statistical conclusions, especially now -- in the age of machine learning. In this talk, I will present our GitHub study, and will briefly cover some of our most recent work on extending duplicate detection to the machine learning models themselves.