Balkinization: Toward a Public Alternative in Digital Archiving and Search

Toward a Public Alternative in Digital Archiving and Search

Frank Pasquale

With inimitable clarity, Cory Doctorow made the case for an open alternative to Google in The Guardian earlier this month. He focused on the secrecy of search:

[S]earch engines routinely disappear websites for violating unpublished, invisible rules. Many of these sites are spammers, link-farmers, malware sneezers and other gamers of the system. . . . The stakes for search-engine placement are so high that it's inevitable that some people will try anything to get the right placement for their products, services, ideas and agendas. Hence the search engine's prerogative of enforcing the death penalty on sites that undermine the quality of search.

[Nevertheless, i]t's a terrible idea to vest this much power with one company, even one as fun, user-centered and technologically excellent as Google. It's too much power for a handful of companies to wield.

Search engines like Google have some good reasons for keeping their algorithms confidential---if they were public, manipulators could swamp searchers with irrelevant results. However, just as Comcast cannot circumvent net neutrality regulation by saying all its traffic management and spam-fighting methods are trade secrets, search engines should not be able to use such arguments to escape regulation altogether. Moreover, there are ways of developing a qualified transparency that would let a trusted third party examine a search engine's conduct without exposing its business methods for all the world to see.

But Doctorow does not want regulation here---he wants an alternative. Having made a similar case for a "public option" in the case of health insurance, I like this line of argument, but I think Doctorow is underestimating the barriers to entry.

Though he's aware of the failure of Wikia, Doctorow wonders if a "wikipedia for search" could be built:

We can imagine a public, open process to write search engine ranking systems, crawlers and the other minutiae. But can an ad-hoc group of net-heads marshall the server resources to store copies of the entire Internet? . . . . It would require vast resources. But it would have one gigantic advantage over the proprietary search engines: rather than relying on weak "security through obscurity" to fight spammers, creeps and parasites, such a system could exploit the powerful principles of peer review that are the gold standard in all other areas of information security.

The “rival public system” approach has been suggested for search engines a few times before. About a decade ago, Introna & Nissenbaum demonstrated that "the conditions needed for a marketplace to function in a 'democratic' and efficient way are simply not met in the case of search engines." Recognizing this, Jean-Noel Jeanneny made a case for a French language alternative to dominant US-based search engines. The Quaero project in the EU appears to be answering that call, though in a far more dirigiste manner than Doctorow would probably like.

I have a few thoughts on a "public option" in search, building on a talk I gave at Yale Law's Library 2.0 conference in the spring.

First, I think we have to fully understand just how big Google's present operation is. They're using somewhere between 100,000 and a million computers to index the web. Is a program like SETI at Home or other distributed computing systems capable of "storing" that in many computers? Indexing the web is a project orders of magnitude more storage- and processing-intensive than hosting an online encyclopedia like Wikipedia, or even hosting the collaborative editing process that is Wikipedia's "secret sauce."

Nevertheless, there are some steps that could lead to an infrastructure for a public option in search. Google's supporters have frequently argued that it needs to scan and store books because they could be lost in disasters. Couldn't a similar case be made that government or an NGO needs to index Google's archive of web pages and books in case, say, a tornado hits a central Google storage facility? At what point does it become critical infrastructure?

Note that there should be a strict separation in such a proposal between information a search engine company properly owns (such as user data patterns, records of how many people clicked on what, etc.), and an underlying collection of materials that would be "archived" as a base of content for the public option. For example, to take one small slice of search, books: I would argue that any settlement of the current lawsuit between Google and publishers should require the U.S. Copyright Office to require digital deposit of all copyrighted books in the US, as a database for a future public option in search. In antitrust terms, the digitized copies are an "essential facility" for future advances in book search---particularly if the cozy relationship between Google and a books "Registry" envisioned in the current settlement documents is ratified by the courts.

The big question here is whether we want a government entity to do all this archiving for the web generally, or some publicly funded third party. Some might think that the latter entity is a better bet in terms of privacy protections. But the more one understands how flimsy a legal barrier separates government actors from "private" data stores, the less difference it makes whether the database used for the public option is in governmental or NGO hands.

Finally, even if a public alternative in search seems unlikely, I deeply believe we need to guarantee one in book search. Note that in web searches, Google's role is usually only to direct us toward what is most relevant--not to ration access to knowledge, a role it so often plays in book search with snippets, restricted portions, etc. In this new role it is much more like a private health insurer rationing access to care than it is your traditional Web 2.0 info-company organizing access to the web by creatively accessing the wisdom of crowds. It's a middleman, and if we've learned anything from the health care field, it's that highly concentrated provider markets combined with highly concentrated insurer markets lead to ever-higher prices for everyone outside that charmed circle of bilateral monopoly. Here's how Joseph White characterized the developments in health care:

One might wonder why consolidation among insurers did not allow them to resist the providers' demand for increased payments. The simple answer is that there were two concentrated parts of the market and one fragmented part. The insurers had to choose between fighting a full-pitched battle with the providers or exploiting their own market power vis-a-vis employers. Raising premiums to employers was a lot easier.

Substitute "publishers" for "providers," "Google" for "insurers," and "readers" for "employers" in that dynamic, and you have a pretty good sense of how the book search settlement will ultimately play out without some alternative service. Right now, Medicare is the only entity exercising genuine price discipline and providing universal access in the US health field. We need something like it in book search.

PS: I have more thoughts on Doctorow's piece in the comments section of this interesting blog post by Berin Szoka. I really hope Doctorow does not endorse First Amendment protection for whatever dominant search engines do.

X-Posted: Concurring Opinions.

Posted 11:02 PM by Frank Pasquale [link]