Skip to content

rossheat/birdhouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Birdhouse 🐦

Go Version License

Bird

The Eurasian Blue Tit by Bradley Hacker · 21 Mar 2017 · Lynford Arboretum, Norfolk, England, United Kingdom

Birdhouse is a command-line tool written in Go that allows for the easy creation of highly customised, high-quality image datasets of birds from eBird.com.

Whether you are building a machine learning classifier, doing ecological research, or just creating a personal collection, Birdhouse handles bulk metadata fetching, dataset balancing (min/max samples), multithreaded downloading, rate-limiting, and manifest generation.


🌟 Features

  • Multithreaded Downloading: Quickly download thousands of images using customisable worker threads.
  • Quality Control: Filter images by minimum community rating and minimum number of reviews.
  • Dataset Balancing: Enforce strict minimum and maximum samples per bird class to ensure an evenly distributed dataset.
  • Resilience & Resuming: Automatically handles eBird rate limits (429/403) by pausing and retrying. Safely interrupt the program with Ctrl+C and resume later without losing progress.
  • Auto-Generated Manifests: Automatically generates a tidy manifest.csv mapping catalog IDs to relative file paths, taxon codes, and common names.

🚀 Installation

Ensure you have Go installed (requires Go 1.26+ due to sync.WaitGroup.Go usage).

# Clone the repository
git clone https://github.com/rossheat/birdhouse.git
cd birdhouse

# Build the binary (optional, you can also just use `go run .`)
go build -o birdhouse .

🛠️ Usage

Here is a full example of running Birdhouse to download a dataset of birds found in Great Britain (GB):

go run . \
  --taxon-codes-file="./taxon-codes.txt" \
  --region="GB" \
  --max-samples-per-class=200 \
  --min-samples-per-class=200 \
  --ebird-session-cookie="eyJ1..." \
  --ebird-session-sig="XnWn..." \
  --min-reviews=2 \
  --min-avg-rating=4 \
  --image-download-threads=15

Resuming a Session

If you need to quit (using Ctrl+C) or your internet drops, you can easily resume the exact same session by appending the --resume-session flag with your Session ID:

go run . \
  --taxon-codes-file="./taxon-codes.txt" \
  --region="GB" \
  --max-samples-per-class=200 \
  --min-samples-per-class=200 \
  --ebird-session-cookie="..." \
  --ebird-session-sig="..." \
  --resume-session="2026-02-27_11-46-33.592"

⚙️ CLI Arguments (Inputs)

Flag Description Default Required?
--taxon-codes-file Path to a text file containing eBird taxon codes (one per line). - Yes
--region eBird region code (e.g., GB for Great Britain, US-CA for California). - Yes
--min-samples-per-class Minimum images required per class (taxon). Taxons not meeting this are skipped. - Yes
--max-samples-per-class Maximum images to download per class (Max: 10000). - Yes
--ebird-session-cookie Your eBird ml-search-session cookie value. - Yes
--ebird-session-sig Your eBird ml-search-session.sig cookie signature value. - Yes
--min-avg-rating Minimum average community rating an image must have (0.0 to 5.0). 4.0 No
--min-reviews Minimum number of ratings/reviews an image must have. 2 No
--image-download-threads Number of concurrent threads for downloading images (1 to 50). 15 No
--resume-session ID of a previous session to resume (e.g., 2024-01-02_15-04-05.000). - No

🔑 How to Get Required Arguments

1. Taxon Codes File (--taxon-codes-file)

This is a simple text file (.txt) containing the eBird species codes you want to scrape, one code per line. To find a taxon code, search for a bird on eBird.org. The code is the short string in the URL. Example: URL is https://ebird.org/species/mallar3 ➞ Taxon code is mallar3.

Example taxon-codes.txt:

aquwar1
parjae
arcter
pieavo1

2. eBird Session Cookies (--ebird-session-cookie & --ebird-session-sig)

eBird's media API requires authentication via cookies. To get these:

  1. Log in to eBird.org in your web browser.
  2. Open the Network Tab in your browser's Developer Tools (Press F12, Ctrl+Shift+I, or Cmd+Option+I).
  3. Navigate to https://media.ebird.org/catalog.
  4. Expand the first logged request.
  5. In the request's Cookies find session. Copy its value and pass it to --ebird-session-cookie.
  6. In the request's Cookies find session.sig. Copy its value and pass it to --ebird-session-sig.

(Note: Keep these cookies secure, as they act as your active login session).


📁 Output Structure

Birdhouse writes all data to a hidden .birdhouse folder in your user's Home Directory (~/.birdhouse). Every time you run the tool without --resume-session, it creates a new timestamped Session ID folder.

~/.birdhouse/
└── 2026-02-27_11-46-33.592/            # Session ID Directory
    ├── metadata/                       # Raw CSV metadata filtered from API
    │   ├── aquwar1-GB.csv
    │   ├── arcter-GB.csv
    │   └── ...
    ├── images/                         # Downloaded .jpg files grouped by taxon
    │   ├── aquwar1/
    │   │   ├── 123456781.jpg
    │   │   └── 123456782.jpg
    │   ├── arcter/
    │   └── ...
    ├── manifest.csv                    # Final compiled dataset manifest
    └── failed.log                      # (Optional) Log of catalog IDs that failed to download

The Manifest File (manifest.csv)

Upon successful completion (or graceful exit), Birdhouse generates a clean manifest.csv file mapping the local images to their metadata. This is perfect for feeding directly into PyTorch, TensorFlow, or Hugging Face datasets.

Example manifest.csv:

Catalog ID Relative Image Path Taxon Code Common Name
14592834 images/arcter/14592834.jpg arcter Arctic Tern
94832011 images/mallar3/94832011.jpg mallar3 Mallard

🛑 Handling Rate Limits & Interruptions

If Birdhouse downloads images too quickly, Cornell's Macaulay Library may issue a 429 Too Many Requests or 403 Forbidden response. Birdhouse detects this automatically and will pause for 5 minutes before resuming.

If you want to stop the tool at any point, simply press Ctrl+C. Birdhouse will intercept the signal, finish saving the current batch of images, write any failed IDs to failed.log, and exit cleanly. You can easily pick up right where you left off later using the --resume-session flag.


⚠️ Disclaimer & Terms of Use

Birdhouse interacts with the eBird API and the Macaulay Library to download media. Please be aware of the following:

  • Terms of Service: You are responsible for ensuring your usage complies with the Macaulay Library Terms of Use and eBird's API Terms of Service.
  • Rate Limits: Do not abuse the API. The tool includes built-in rate-limit handling, but setting the --image-download-threads too high may result in your IP or account being temporarily or permanently blocked by Cornell Lab of Ornithology.
  • Non-Commercial Use: Data downloaded from the Macaulay Library is typically for personal, non-commercial, or academic/research purposes. Always double-check licensing before using these datasets in commercial machine learning models.

🛠️ Troubleshooting

"unexpected status 401/403 for [taxon-code]" or "parsing csv... EOF"

  • Cause: Your eBird session cookies have likely expired or are incorrect.
  • Fix: Log out of eBird, log back in, and grab fresh values for --ebird-session-cookie and --ebird-session-sig.

"no classes met the minimum sample threshold... after download"

  • Cause: You set --min-samples-per-class too high, and after filtering for rating/reviews and downloading, no classes had enough images to meet your requirement.
  • Fix: Lower the --min-samples-per-class value, lower the --min-avg-rating, or reduce the --min-reviews requirement to broaden the pool of accepted images.

*"wg.Go undefined (type sync.WaitGroup has no field or method Go)"

  • Cause: You are using a Go version older than 1.26.
  • Fix: Upgrade your Go installation to 1.26 or newer.

📝 License

Distributed under the MIT License. See LICENSE for more information.


Happy Birding! 🦉

About

Create local datasets of images hosted at ebird.com

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages