WILDLABS Virtual Meetup Recording: Big Data in Conservation

The WILDLABS Virtual Meetup Series is a program of webinars that bring leading engineers in the tech sector together with conservation practitioners to share information, identify obstacles, and discuss how to best move forward. 

Our third virtual meetup on Big Data in Conservation is now available to watch, along with notes that highlight the key takeaways from the talks and discussion. In the session, speakers Dave ThauDan Morris, and Sarah Davidson shared their work in 10-minute presentations followed by lively open discussion and community exchange.

Date published: 2018/11/27

Overview

The WILDLABS Virtual Meetup Series is a program of webinars for community members and wider partners to discuss emerging topics in conservation technology and leverage existing community groups for virtual exchange. The aim of the series is to bring leading engineers in the tech sector together with conservation practitioners to share information, identify obstacles, and discuss how to best move forward.

The series began in late 2018, to be continued in 2019, and will be hosted on WILDLABS via Zoom. The three topics we've chosen to cover in 2018 include Networked Sensors for Security and Human-Wildlife Conflict (HWC) PreventionNext-Generation Wildlife Tracking, and Big Data in Conservation. The first two topics centered around data collection, on wildlife populations through tracking and on protected areas and community boundaries through networked sensors, while this third topic tackles how to most effectively utilize that data.

There is a lively discussion about possible topics members would like to have space to discuss, so if you have ideas for future meetups please join the thread and share your thoughts. 

Meetup 3: Big Data in Conservation

Date & Time

Wednesday, December 12th

Main Talks: 3:00-4:00pm GMT / 10:00-11:00am EST

Additional half hour for discussion: 4:00-4:30pm GMT / 11:00-11:30 EST

Background & Need

With new technologies revolutionizing data collection, wildlife researchers are becoming increasingly able to collect data at much higher volumes than ever before. Now we are facing the challenges of putting this information to use, bringing the science of big data into the conservation arena. With the help of machine learning tools, this area holds immense potential for conservation practices. The applications range from online trafficking alerts to species-specific early warning systems to efficient movement and biodiversity monitoring and beyond.

However, the process of building effective machine learning tools depends upon large amounts of standardized training data, and conservationists currently lack an established system for standardization. Therefore, how to best develop such a system and incentivize data sharing are questions at the forefront of this work. There are currently multiple AI-based conservation initiatives, including Wildlife Insights and WildBook, that are pioneering applications on this front. Building upon our two previous virtual meetups, as well as recent conversations taking place within the broader conservation tech community, this discussion will address current efforts, illustrate how they fit together, and frame them within these broader questions about the future of big data in conservation.

Outcomes

The aims of this discussion are as follows: to introduce the technologies used for processing big data in the context of conservation; to describe how they are being used for conservation, including what needs they are addressing in conservation practice and how different approaches fit together; to identify the obstacles in advancing the capacity of these technologies from both field and tech perspectives; and to discuss the future of big data tech, including the sustainability of its applications and how best to collaborate moving forward.

Agenda

  • Welcome and introductions (5 min)

  • Dave Thau, Data and Technology Global Lead Scientist at WWF-US (10 min)

  • Dan Morris, Principal Researcher, Microsoft - AI for Earth (10 min)

  • Sarah Davidson, Data Curator at Movebank (10 min)

  • Q&A discussion with speakers (20 min)

  • Optional ongoing discussion and community exchange (30 min)

  • Takeaways and wrap up (5 min)

Recording

Big Data Meetup Link to Video Recording

Click through here to watch the full meetup.

Virtual Meetup Notes

For our final WILDLABS Virtual Meetup of 2018, we were joined by over 100 attendees from all over the world! Thanks to everyone who participated in the live chat and Q&A, and especially to those of you who were able to stay on for the discussion at the end. There were so many great questions that even that extra half hour didn't let us cover them all. It has been wonderful to see so much enthusiasm around these meetups, and we look forward to continuing the series in the new year. For those of you who were unable to join live, we've recorded the session so that you may view it at your convenience. We've shared key takeaways in the notes below.

Speaker: Dave Thau

Background

  • The Cambrian Explosion… of Data: a massive increase in data over the past decade – expected to reach up to 175,000 Exabytes (billions of GB) by 2025 
  • Three aspects of managing big data
    • Volume: amount
    • Velocity: the speed at which they come in
    • Variety: different kinds of information we’re dealing with

Big Data applications in conservation

Volume

  • Large volumes through satellite data - of the 5,000 satellites orbiting the planet, 2,000 are active (other 3,000 essentially junk); of these active 2,000 about 600 are earth observation satellites
  • How to manage it
    • Google Earth Engine takes publicly available geospatial data and aggregating it onto Google servers for analysis
      • Initial run numbers: 29 years/909 terabytes of data, took 2 million CPU hours to compute – i.e. would have taken centuries to do on one computer, but running in parallel over 66,000 computers it only took about a day and a half
    • Global forest watch takes data and makes it easy for users to analyze it how they want – making this available has had real conservation impacts (e.g. Philippines House of Representatives approved a new bill prohibiting destruction of mangroves, deforestation revealed in Peruvian Amazon)

Variety

  • Species observation data – aggregated by Global Biodiversity Information Facility (1 billion records)
    •  This data is being combined with other types of data like species range information topological data etc. to inform species health assessments
      • E.g. Map of Life providing reports to UN on how well protected different species are (44,000 species)
  • Camera trap data – aggregated by Wildlife Insights (set to launch in 2019), using machine learning to classify images as they are uploaded
  • Environmental DNA data – species presence by DNA in environment – e.g. WWF eDNA project using polar bear footprints

Velocity

  • The speed of reporting is increasing as a result of faster data intake and artificial intelligence tools for analysis
    • Global Forest Watch has sped up from annual reports to weekly reports by doing rapid analysis of satellite data as it comes in
    • Global Fishing Watch aggregates AIS data (mandated location signal for large fishing vessels) – 22 million data points a day, analyzed using machine learning, sends out alerts when vessels are fishing in protected areas

Future Directions

  • Real-time data increasing – IDC estimates that in 2025 30% of data will be real-time
  • Moving from monitoring to actionable prediction– valuable in conservation context to be able to identify future problems and address them early
  • New satellite-based, drone-based, and land-based sensors will keep us busy dealing with new data and figuring out how to work well across multiple sensors

Speaker: Sarah Davidson

Background

  • Bio-logging data: collected by sensors on animals
    • Gather information on location, behavior, health, external conditions, etc. over time – lots of data!  
    • Currently managed locally or in many shared online databases
  • Movebank
    • Open online platform hosted by Max Planck Institute for Ornithology
    • Global database for bio-logging data
    • Tool for working with data throughout its lifecycle (collect, manage, analyze, archive, re-use)
    • Steady growth of database content in users and taxa since launch in 2012, massive leaps in locations and types of sensors (particularly high-resolution data)

Bio-Logging Data Tools for Conservation Applications

  • Low-cost tools – Movebank is free to users, supported by a range of grants and institutions 
  • Access restrictions and security – users retain ownership on Movebank, allowing them to share what they want on the level they specify, ranging from completely private to completely public
  • Features for wildlife management – real-time data is vital in many cases, as we need to be able to respond in the field
    • Automated data feeds for 15 tag companies – you can subscribe to get your data brought in multiple times a day and fed into your database for you to view and manage in real-time
    • Software and API
      • Easy download and access to EnvDATA (environmental data annotation tool)
      • Animal Tracker App
      • R packages “move” and “ctmm” let you pull your data directly from Movebank and analyze in R, don’t need to store data locally
  • Support for transboundary and multi-institution projects
    • E.g. Animals on the Move project (part of NASA funded Arctic Boreal Vulnerability Experiment) – 50 animal movement datasets shared by ~30 collaborating institutions
      • Maintain controlled access but keep all data in the same format on the same platform
  • Standards for discovery and integration – how to support beneficial use of sensitive data over time?
    • Lack of community-wide standards for bio-logging data
      • International Bio-Logging Society Data Standards Working Group is focusing on this
    • Long-term persistence – what’s our long-term plan for maintaining our ability to find and access data over time?
      • For non-sensitive data, Movebank will review and publish datasets with a DOI to cite and maintain access, but this makes up a small percentage of the data they host
      • For sensitive and non-public data, we risk losing access if permissions are limited to a few people and we lose contact with them, etc. – what’s our responsibility to future generations in terms of making sure this data is available for them to inform conservation decisions?

Speaker: Dan Morris

Background: Microsoft AI for Earth 

Making cool stuff happen at the intersection of machine learning and environmental sustainability

  • Grants program that helps other people do this work – cloud compute credits and funding
  • Microsoft building tools to solve sustainability problems

Computer vision challenges in wildlife conservation

Species classification from handheld photos (taken by a person with a camera, rather than camera trap)

  • Species Classification API Demo
    • This is a specific model they’ve built to classify species in images – you can upload your own photos to this demo and experiment with it (it even classified Dan’s dog and his son’s stuffed animals correctly!)
    • The infrastructure this web app sits on is the same set of tools they provide AI for Earth Azure grantees to build apps just like this one on top of their own work
    • API details: InceptionV4/InceptionResNetV2 ensemble, trained in PyTorch on Azure NC12v3 DSVM, ~3 weeks of training on 2 GPUs, train-time augmentation: scale, flip, color, crop 80%/95% top 1/5 validation accuracy on ~5k classes, Hosted on AI4E backend
  • How can you get involved?

Automating camera trap image processing

  • Huge numbers of “empty” or false trigger images – biggest issue with machine learning for camera trap data is getting good labeled data to train on
  • To help with this, AI for Earth partnered with Zooniverse and the University of Wyoming to set up open data repository at http://lila.science  (getting open data out there for machine learning people to work with is what’s important right now!)
    • Adding value - Microsoft is manually annotating uploaded data by adding bounding boxes around animals, which makes it easier to train detectors
    • Detector training – focusing on training detectors (is there something in this image?) rather than classifiers (what is the thing in this image?) in order to:
      • Get rid of empty images right away
      • Increase generalizability (detectors work well in ecosystems all over the world)
      • Simplify classification by cropping images once you know where the animal is
    • Detector training details: trained in TensorFlow on NC12v3 DSVM, Faster R-CNN with InceptionResnetV2 base, Base pre-trained on COCO
    • Camera Trap Demo – preliminary API to host this detector and let people run it on their own images
  • How can you get involved?

Feedback

If you attended or have watched the recording, please take this quick survey to give us feedback so that we can improve future events.

Is there another topic you'd like to have covered in this series? Join the disucussion to help shape future events.