Wednesday, March 25, 2015

Centralization, Fragmentation, and Replication in the Genomic Data Commons

Guest Blogger

Peter Lee

For the Innovation Law Beyond IP 2 conference, March 28-29 at Yale Law School

Genomics—the study of organisms’ entire genomes—holds great promise to advance biological knowledge and facilitate the development of new diagnostics and therapeutics. Genomics research has benefited greatly from various policies requiring the rapid disclosure of nucleotide sequence data in public databases. The result is a genomic data commons, a widely-accessible repository of information from which all members of the scientific community can draw. Notably, this intensely productive space operates almost completely outside of formal intellectual property law through a combination of public funding, agency policy, and communal norms.

The genomic data commons has attracted significant scholarly interest both because of its great potential to advance biomedical research as well as its broader lessons about the nature of commons-based productivity. For instance, Jorge Contreras has charted the evolution of the genomic data commons from a system that essentially disseminates information into the public domain into a more complex, “polycentric” governance institution for managing knowledge resources. This paper, which grows out of Brett Frischmann, Michael Madison, and Kathy Strandburg’s project to study commons governance, explores less appreciated but highly significant complexities of managing genomic information. In so doing, it seeks to shed greater light on the nature of commons in general.

In particular, this paper focuses on the governance challenges of correcting, updating, and annotating vast amounts of sequence data in the commons. Most legal accounts of the genomic data commons focus on researchers’ initial provisioning of data and access to such data by other scientists. Delving into the science of genome sequencing, assembly, and annotation, however, this paper highlights the indeterminate nature of sequence data and related information. Quite simply, the genomic data commons is full of errors and incompleteness. Accordingly, this paper examines four approaches for correcting, completing, and updating existing data: contributor-centric data management, third-party biocuration, community-based wikification, and specialized databases and genome browsers. It argues that these approaches reveal deep tensions between centralization and fragmentation of control within the genomic data commons, a tension that can be mitigated through a strategy of replication.

On one hand, contributor-centric data management and third-party biocuration represent mechanisms for centralizing control over data. In these models, the original data contributor or database manager has almost an exclusive ability to update existing records. On the other hand, wiki-based annotation fragments control throughout the community, exploiting the power of peer production and parallel data analysis to augment existing data records. Both centralization and fragmentation have their pros and cons, and this paper argues that stakeholders can capture the best of both worlds through exploiting the nonrivalry of information. In particular, researchers are engaged in a strategy of replication, employing specialized databases and genome browsers that combine centralized, archival data and widespread community input to provide more textured, value-added renderings of genomic information. Among other advantages, this approach has important epistemological implications, as it both reflects and reveals that genomic knowledge is the product of social consensus.

Among other implications, this study reveals that the genomic data commons is both less and more of a commons than previously thought. On one hand, it features a highly centralized data architecture. The efforts of thousands of genomic researchers around the world feed into a consortium of three publicly-sponsored databases, which members of the community may not modify directly. On the other hand, this knowledge system represents a set of commons on top of a commons. At one level, it’s an archival data repository emerging from a global community of scientists. At another level, however, the genomic data commons also encompasses many sub-communities (often organized around model organisms) that develop their own specialized databases and nomenclatures. Additionally, user groups develop meta-tools such as genome browsers and freely distribute them throughout the community, thus helping to make genomic data more intelligible.

Furthermore, this study highlights the strong role of centralization and standardization in the effective operation of a commons. The commons is often perceived as an open space free of government intervention and insulated from market demands. Indeed, the genomic data commons has been structured quite conscientiously to operate outside of the legal and economic influence of patents. However, the genomic data commons underscores that commons-based productivity systems are not simply "free for alls" lacking order or regulation. Too much control, and the power of sharing, parallel processing, and peer production goes unrealized. Too little control, however, and the commons just dissipates into chaos and entropy. Truly effective commons function at the balance of centralization and fragmentation.

Peter Lee is Professor of Law and Chancellor's Fellow at the University of California, Davis, School of Law. He can be reached at ptrlee at


Older Posts
Newer Posts