View From Sapsucker Woods: Supercomputers Wring Meaning From eBird Data

By John W. Fitzpatrick
July 15, 2011
big data and analysis of bird observation data
subscribe to Living Bird magazine

Revolution is in the air, and is being spurred to dizzying new heights by the Internet. No, I am not referring to the social upheaval taking place in the Middle East, but rather to a quiet revolution much closer to home involving tens of thousands of people and tens of millions of observations. This revolution is being generated right here at Cornell, but it is bound to alter forever the way humans look at animal distributions the world over. Already, it has produced a report heralded as a milestone by the U.S. Secretary of the Interior.

The revolution started quietly a few years ago, as bird watchers began entering their observations directly online via the first generation of eBird. As project leaders and back-end programmers worked furiously to improve data-entry interfaces and provide personalized services, eBird steadily gained converts and the drumbeat quickened. Revolutionary forces began assembling, inspired by a union of visionary statisticians, computer scientists, and ornithologists, and fueled by far-sighted funding from the National Science Foundation and the Leon Levy Foundation. Specialists in “machine learning” embraced the exponentially growing eBird data set as an opportunity to develop and test new data-mining algorithms using a biological system challenged by real-world environmental problems. Hundreds of continent-scale data sets reflecting human and environmental variables were brought into the war room, but the revolution still lacked sufficient firepower to take off: too many data; too many variables; vastly too much surface area across which to carry out repetitive, locally tuned statistical procedures. Finally, about a year ago, a long-awaited breakthrough took place. The true revolution exploded into view over this past winter.

The massive breakthrough tool is called the TeraGrid—a remarkable integration of high-performance computers, data resources and tools, and high-end experimental facilities around the United States, touted as “the world’s largest, most comprehensive distributed cyber infrastructure for open scientific research”. Our first effort to crunch the huge eBird data set together with hundreds of other continental data resources required thousands of hours of computing time on the TeraGrid.

The resulting visualizations were literally breathtaking—we could call them “next-generation distribution maps.” Viewed as static snapshots of distributions at single moments in time (for example, the map at right details the U.S. distribution of Wood Thrushes on June 28, 2009), their detail is stunning. Even more spectacular, for any given species, are 52 weekly maps displayed sequentially to produce spellbinding animations of the entire annual migratory cycle. Never before has such detailed information about animals’ distributions been synthesized and visualized so clearly.

Word of these revolutionary distribution maps spread quickly after a collection of them was posted on eBird. Colleagues who thought they’d known everything about bird distributions were transfixed by surprising new details, even about common birds. Potential applications are endless and are just beginning to be explored. Cornell Lab scientists partnered with colleagues at other conservation organizations and federal agencies to quantify the importance of public lands in protecting birds across the United States. Using a grant of 70,000 hours on the TeraGrid, the breeding distributions of 150 bird species were mapped at 30 km resolution and analyzed together with public ownership maps. The resulting report—State of the Birds 2011—clearly reveals the crucial conservation roles played by our national parks, forests, wildlife refuges, grasslands, and BLM lands.

The revolution continues to gather strength. During the month of May 2011, eBird logged 3.1 million observations (more than the number logged during its first three years combined). A new grant providing 3 million hours on the TeraGrid will map all of North America’s breeding birds at 3 km resolution, weekly, from 2007 through 2010. Our views of bird distributions will never be the same again, and we anxiously await the day when the data are available for this to be a worldwide exercise. Viva la revolución!

John W. Fitzpatrick
Louis Agassiz Fuertes Director
Cornell Lab of Ornithology