hive-logo-medIn our last blog post we introduced you to Madison, a crowdsourcing project where readers provide data on the historical ads within our archives. Madison required a system that could keep track of our tasks, users and candidate ads; combine the three to create assignments, based on which ads and users are eligible for that task; and validate the crowdsourced data we receive according to custom criteria.

Madison‘s planning stages involved a review of the current landscape of crowdsourcing solutions. We looked at existing hosted and open source platforms and libraries, paying special attention to ProPublica’s Transcribable Rails plugin, Zooniverse’s Scribe and PyBossa. While these projects do an excellent job handling certain use cases, none of them could support what we envisioned for Madison. Further, we realized that the kind of workflow we were designing could be abstracted to support a wider variety of crowdsourcing requirements than just the assets within our archives. We were intrigued by the idea of designing a service that defined the pieces of a crowdsourcing process in a highly flexible, customizable and modular way, and how such a platform could expand the set of things available to crowdsource.

The system we built is Hive, an open-source platform that lets developers produce crowdsourcing applications for a variety of contexts. Informed by our work on Streamtools, Hive’s technical architecture takes advantage of Go’s efficiency in parsing and transmitting JSON along with its straightforward interface to Elasticsearch. Combining the speed of a compiled language with the flexibility of a search engine means Hive is able to handle a wide variety of user-submitted contributions on diverse sets of tasks.

The workflow looks like this:

Hive Workflow

As you can see, there is a loop forming the core of Hive: a particular task is repeatedly done on an asset until accurate data is received, at which point the asset (now with new information on it) can reenter the pipeline with a different, newly-available task. (For example, in Madison, an asset can only enter the Tag phase once it has been verified during the Find phase as having “exactly one ad.”) This frees the developer from having to define every possible combination of assets and tasks. It also means that more tasks can be added as the developers get a clearer view of the type of data they’re dealing with (e.g. if an asset is marked as “multiple ads,” we could, in the future, implement a Cut task that is solely available for those assets). This loop lets developers gather increasingly granular information about a particular asset and, therefore, ask better questions for their audience to answer.

Of course, no crowdsourcing platform could be successful without the crowd. Hive has an intentionally flexible definition of a user: you may require a signup and login process, or simply allow anonymous contributions to lower the barrier of entry. Hive keeps track of each user’s number of contributions by task, both as a total, and further broken down by how many were skipped, completed or verified. The platform also supports asset favoriting, a feature we use on Madison’s profile pages.

Hive’s technology was chosen with an eye towards supporting an iterative design process without sacrificing performance. Elasticsearch, “a swiss army knife” of a data layer, seemed an obvious choice: adjustable enough to store structured and schemaless documents, indexed almost instantly and retrievable through a RESTful API with all the full-text search power of Apache’s Lucene. Hive uses the elastigo library to manage all the data associated with crowdsourcing applications, from project configuration to user submissions and statistics, in Elasticsearch.

All functionality in Hive is surfaced through API endpoints, easy to integrate with a web site through simple Javascript calls. Decoupling the crowdsourcing functionality from the front-end means you can focus on design and user experience instead of data wrangling.

Hive speaks to the Lab’s overall goal of providing new methods for receiving, storing, and understanding the sources of data around us. The past two months of dedicated and active contributors on Madison show that Hive can not only support a wide variety of crowdsourcing tasks, but it can do so even under high traffic.

Hive is free and open source, and we look forward to seeing the diverse crowdsourcing projects it powers in the future.