More than 70 percent of code stored in GitHub is duplicated, a study has found.

An international team of eight researchers working for the University of California were surprised when they started doing a study into how how much files changed between different clones. What they discovered was a ?staggering rate of file-level duplication?.

Presented at this year's SPLASH conference in Vancouver, the research found that out of 428 million files on GitHub, only 85 million are unique.

This paper looked at 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. It found that this corpus has a mere 85 million unique files.

In other words, 70 percent of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only six percent of the files are distinct. Java, on the other hand, has the least duplication, 60 percent of files are distinct.

A project-level analysis shows that between nine percent and 31 percent of the projects contain at least 80 percent of files that can be found elsewhere.

These rates of duplication have implications for systems built on open source software as well as for researchers interested in analysing large code bases.