Cameron Jones Graduate School of Library and Information Science University of Illinois Copy-Paste Programming: An exploratory analysis of software clones in Open-Source Introduction The Free and Open Source movements have generated large collections of source code in repositories like SourceForge.net, the Comprehensive Perl Archive Network (CPAN), the Mozilla Foundation, CSourceSearch.net, and the GNU Project. These source code repositories present many opportunities for compelling new research. Clone analysis has emerged in software engineering as a means of detecting duplicated regions of source code. Typically, these techniques are applied within the context of a single software system or application as a way of measuring the quality of the software. The underlying premises are that duplicated code increases the size and complexity of the software, propagates bugs and errors, and complicates later maintenance of the code. Given that 10-15% of the source code in large projects is duplicated code, there has been tremendous attention paid to developing methods for identifying and eliminating software clones. Clone analysis, however, has not been widely applied as a technique for looking across systems for duplicated code between applications, or really towards looking outside the scope of a single application. There have been a small number of studies which have done clone analyses across projects, but their approaches are limited to merely counting the number of clones identified between projects. None of these studies has attempted to map out a larger picture of coupling and similarity between projects which uses clone detection at the heart. Other research has taken a more qualitative approach to understanding the role of copying and duplicating code in programming. Rosson and Carroll (1993) report on the activities of a small number of programmers and their reuse of example code when incorporating a new library or module. Kim, at al. (2004) discovered that programmers copied code at an average rate of 16 times per hour, which confirms the commonly held belief that programmers copy and reuse existing code. These studies provide useful qualitative descriptions of the processes by which programmers incorporate copied code. However, the Rosson and Carroll study only identifies a limited set of sources which were included in the source code itself, and Kim does not even address the sources of copied code. There is ample opportunity to explore this larger open-source sphere in order to understand, more broadly, the mobility of source code within this space. The phrase "programming by Google", reflects the attitudes of many hobbyist and amateur programmers, but how pervasive is this behavior and what does it tell us about programming practice in general? Without sufficient historical archives or logs, we cannot address directly questions of the "origin" of clones and copied code. However, we can identify patterns of clone-sharing among projects as well as between projects and external sites. This research seeks to address the following research questions: 1. What are the patterns of clone duplication among open-source projects, and between open-source projects and external resources like websites, project documentation, mailing lists, and discussion forums? Do the patterns resemble patterns of citations in citation networks? 2. Are there certain clones or classes of clones which are cloned more often than others? 3. Are there certain software projects or classes of projects which are involved in cloning more often than others? 4. Are there certain external resources, or classes of resources which are involved in cloning more often than others? 5. Can clones be used as a similarity measure for use in clustering analysis? Do the clusters resemble any meaningful groupings? Data & Methods Source code will be collected from the SourceForge.net open source repository. For this study, only projects written in the PHP programming language will be used. Furthermore, a set of external sites will be crawled and any code samples present of sufficient length will be downloaded. The exact set of sites will be compiled through a small pilot study where the researcher will log copy operations from websites during programming activities. The computational complexity of the clone analysis algorithms does limit the total number of lines of code that can feasibly be analyzed; however a precise sampling framework has not yet been decided. Clone analysis algorithms are divided into two different classes: metrics-based algorithms and parameterized string-matching algorithms. Metrics-based algorithms attempt to identify clones by measuring various features of the source code and the abstract syntax tree derived from the source in order to generate fingerprints, which can then be compared. The set of metrics used in any particular algorithm varies, but may include number of lines of code, cyclomatic complexity, parameter counts, maximum level of nesting, etc. Parameterized string matching algorithms start by normalizing the source code as a string by removing comments and whitespace. Token names in the source code (e.g., types, variables, constants, function names, etc) are parameterized. For example, the expression a = 2 * b is transformed into P = P * P (Baker, 1992). Exact string matching algorithms are then used to identify maximal matching substrings within the transformed source code. Matching sequences of sufficient length are then analyzed for a parameterized match, which iteratively replaces the parameterized tokens and compares matching sequences. The assembled code will be organized by source (project or external resource) and processed using the CCFinder clone analysis toolkit (Kamiya, et al., 2002) to identify cloned regions of code. CCFinder is a parameterized string matching algorithm. A source-by-source matrix C will be constructed using the resulting clone counts, where Ci,j is the amount of cloning present between the ith and jth sources. The precise formulation of this matrix and the subsequent methods of analysis are open questions on which the researcher is seeking the input of the CSNA community. Works Cited Baker, B. S. (1992). A Program for Identifying Duplicated Code. Computer Science and Statistics. Kamiya, T., Kusumoto, S., Inoue, K. (2002). A multilinguistic token-based code clone detection system for large scale source code. Transactions on Software Engineering. 8(7):654-670. Kim, M., Bergman, L., Lau, T., Notkin, D. (2004). An Ethnographic Study of Copy and Paste Programming Practices in OOPL. Proceedings of the 2004 International Symposium on Empirical Software Engineering. Rosson, M. B., Carroll, J. M. (1993). Active Programming Strategies in Reuse. Proceedings of the 1993 European Conference on Object Oriented Programming.