Competition for Particle Detection in 3D Cryo-Electron Tomography

Develop a ML model for annotating subcellular structures and proteins in CryoET data

Started

November 6, 2024

Ends

February 5, 2025

Prizes

$75k Total

About the Competition

We are holding a competition for the development of machine learning algorithms to overcome a major bottleneck limiting biomedical discoveries—the annotation and analysis of high resolution 3D images from advanced imaging technologies.

Goal

Advance the understanding of cell biology through machine learning algorithms to annotate particles in 3D images of cells captured by cryoET.

These algorithms should be able to perform robust annotation of particles of variable shapes and sizes within the hundreds of 3D images in the competition dataset after being trained on a limited set of available reference annotations from the same dataset.

Impact

The algorithms developed through this competition will not only be considered for one of 10 cash prizes, but will also be compatible for use on existing standardized data in the open access CryoET Data Portal. This means they can readily have a huge impact on the field of cell biology, and even lay a foundation for future medicines.

Understanding detailed protein interactions in the context of the cell's internal architecture is not possible with other conventional bioimaging technologies like light microscopy. CryoET can do this, but the field is currently stuck in a bottleneck of annotation. For example, there are over 15,732 tomograms publicly available in standardized format through the CryoET Data Portal alone. Only 5% of these have molecular annotations, which are necessary to derive insights about cell biology and find new targets for therapies to treat diseases.

By design, any tools developed for the purposes of the competition are compatible to be applied to the entire data corpus on the CryoET Data Portal with relative ease. Competition algorithms can also help provide benchmark datasets and standardized pipelines to guide the continuous improvement of machine learning models.

The CryoET Data Portal is only a year old, but with a growing number of contributors, collaboration between the CryoET community and machine learning developers can uncover new insights about how all of the components of a cell come together—in different cells and during different states of health, disease and age. The field is due for a revolution: machine learning algorithms that can robustly annotate different particles within cells will unlock the discoveries currently trapped in thousands of existing tomograms.

Challenge Task

Contestants are tasked with developing ML algorithms that can robustly perform multi-class particle annotation on hundreds of unseen tomograms after being trained on a limited set of annotated tomograms from the same dataset as well as any other data — either synthetic or experimental — of the competitors' choice.

To encourage generalizability, we have chosen five target particles that have diverse shapes and collectively span nearly an order of magnitude in molecular weight:

  • Virus-like Particles or VLPs
  • Thyroglobulin
  • Beta-galactosidase
  • Apo-ferritin
  • 80S ribosome

The algorithm must be able to identify these target particles amidst non-target particles present in the samples, including nucleosomes, filaments, proteasomes, membrane-bound proteins, as well as structural features like membranes that frequently confuse picking algorithms.

Prizes and Judging Criteria

Ten prizes will be awarded for overall picking accuracy of all 5 particles:

PrizeAmount
1st Place:$15,000 USD
2nd Place:$12,000 USD
3rd Place:$10,000 USD
4th Place:$7,000 USD
5th Place:$6,000 USD
6th-10th Place:$5,000 USD each

Judging accuracy for these prizes will be relative to reference annotations that we generated through an elaborate and rigorous workflow. These “ground truth” labels will be used to score participants' results and will be released on the CryoET Data Portal as a resource to benchmark future algorithm development after the contest ends.

See Leaderboard

How to Participate

Location

The competition is hosted on Kaggle.

Timeline

EventDate
Competition Start:November 6, 2024
Entry Team Merger:January 29, 2025
Competition End:February 5, 2025
Awards Distributed:TBD
Outcomes Workshop:April 2025

Rules

A truncated list of the rules is below (see here for the full list of rules):

  • One account per participant: You cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts.
  • No private sharing outside teams: Privately sharing code or data outside of teams is not permitted. It's okay to share code if it's made available to all participants on the forums.
  • Team Limits: The maximum team size is 5.
  • Winner license type: Open source. Specific licensing is MIT.
  • Eligibility: Competition is open to residents of the United States and worldwide except where excluded by Kaggle's terms or the competition eligibility rule.

Challenge Resources

To reduce the onboarding time for competitors, an extensive set of example notebooks is being provided. These include the following models:

These notebooks also leverage the copick library for handling cryoET datasets, for which we provide PyTorch Datasets and utility functions to simplify the creation of data loaders, metadata tracking, and model performance analysis.

These example notebooks can be found in the Github Repository - CZII ML Challenge Notebooks.

In addition, the CryoET Data Portal provides multiple annotated datasets of protein complexes in situ which can be used to tune machine learning algorithms to the crowded nature of in situ samples.

Competition Data

Competition Deposition Name:

CZII - 2024 CryoET Object Identification Challenge

Experimental and simulated training data for the CryoET Object Identification Challenge. Each dataset contains tilt series, alignments, tomograms and ground truth annotations for six protein complexes (Apo-ferritin, Beta-amylase, Beta-galactosidase, cytosolic ribosome (80S), thyroglobulin and VLP). Curation procedures are described in detail in the accompanying paper. Details on how the dataset is used in the competition are available on Kaggle.

Deposition Preview

Want more data for training your model? Find it here on the CryoET Data Portal.

What is CryoET?

Overview

Cryo-electron tomography (CryoET) is an imaging technique that enables 3D visualization of the cell at sub-nanometer resolution but, unlike other high-resolution imaging techniques, the cryogenic (frozen) condition preserves cellular architecture so this detailed view includes protein structures in their natural biological context. Three-dimensional tomograms can be generated from many images of a thin slice through a cell, taken while tilting the specimen in multiple directions.

A given tomogram is typically only about 200 nanometers thick—approximately five hundred times thinner than a sheet of paper—yet packed with information about the structures of the cellular machinery driving health and disease.

For more information about CryoET basics, check out the educational articles from the CryoET Data Portal documentation site.

Current Barriers

Technological advancements have improved the efficiency of sample preparation, imaging, image processing, and open access sharing of standardized data. But annotation is necessary to turn these data into useful insights and, in most cases, is the rate limiting step. Annotation involves tracing sparse particles of subcellular structures through many layers of low contrast, two-dimensional images. Annotation is required across many tomograms, for thousands of copies of a particle in order to average to a high resolution understanding of a protein complex. While annotation is still, for the most, done manually, for a set of tomograms this can take months, and on the total volume of data that already exists is not possible.

Annotation of 3D tomograms is particularly challenging because of the:

  • low signal-to-noise ratio of the projections and volumes
  • structural complexity of cellular samples
  • diversity and heterogeneity of the molecules of interest
  • large numbers of molecules that are required for high resolution maps
  • variable number of particles per tomogram

Machine learning has been leveraged to provide membrane segmentations for over 15,000 of the 3D images currently available in the open access CryoET Data Portal. This cut down annotation time from several months to just over 3 days. Annotating particles such as protein complexes, however, is a far more difficult task due to their diversity, lower contrast, and crowding.

Further, current machine learning methods for CryoET are effective for the specific cases they were developed on, but typically do not generalize sufficiently to meet the diverse needs of the CryoET community. Annotation strategies that can be readily adapted to datasets acquired under different settings, and that better capture the heterogeneity of biomolecules would be transformative compared to currently available approaches.

Sample & Reference Annotations

As synthetic datasets for CryoET have not historically led to robust algorithms that can be applied to real data, we generated a “phantom” dataset for the purposes of this challenge—a nomenclature used in the biomedical research community to denote objects used as stand-ins for tissues to ensure that systems and methods for MRI and CT imaging are operating correctly.

This phantom dataset mimics the cellular environment to some extent while providing a more controlled system that allowed us to provide, in a reasonable time frame, a relatively large annotated set of data containing widely varying structures. It has several key components designed to simulate some of the diversity and complexity of a cellular tomogram, including:

  • Cell lysate, which provides common elements such as ribosomes and membranes to offer a realistic backdrop of cellular material
  • Commercially available proteins:
    • Apo-ferritin
    • Thyroglobulin (THG)
    • Beta-galactosidase
    • Beta-amylase
    • Human Serum Albumin (HSA)
  • Virus-like particles (VLPs)

These components collectively provide a range of molecular weights and shapes for testing and developing machine learning algorithms for protein complex annotation. Our phantom sample captures many of the essential elements of CryoET data, including inherent background noise, a range of shapes and sizes of protein complexes, and the challenges associated with distinguishing similar-looking species.

The phantom sample, however, does not capture the crowdedness of real cellular environments. It nevertheless serves as an initial well annotated dataset for developing, training, and advancing machine learning CryoET data annotation, with the anticipation that this will lead to algorithms that will be later capable of labeling protein complexes in realistic cellular environments.

Generating “ground truth” reference annotations required multiple months of work by a large team of people and spurred the development of several new methods for particle picking, visualization, and curation. This effort included manual annotation, the use of template matching and machine learning algorithms, and the careful curation of the selected particles using a variety of approaches.

Below are the in-house tools that were developed as part of these efforts:

  • DenoisET
  • Copick
  • Slab-picking
  • Copick-ChimeraX
  • Copicklive
  • DeepFindET
  • CellCanvas
  • Minislab curation

We have already repurposed several of these tools for other projects at our institute so anticipate these new methods will prove valuable beyond this challenge.

Learn more about the phantom sample and reference annotations in our publication.

Competition Contributors

Contributors are listed alphabetically by last name:

  • David Agard
  • Richa Agarwal
  • Ashley Anderson
  • Jeremy Asuncion
  • Rodrigo Baltazar
  • Tristan Bepler
  • Nina Borja
  • Alister Burt
  • Bridget Carragher
  • Mikala Caton
  • Anchi Cheng
  • Chi-Li Chiu
  • Yongbaek Cho
  • Ellaine Chou
  • Bryan Chu
  • Charlie Dubbledam
  • Kira Evans
  • Kirsty Ewing
  • Jessica Gadling
  • Lorenzo Gaifas
  • Kyle Harrington
  • Matthias Haury
  • Utz Heinrich Ermel
  • Norbert Hill
  • Erin Hoops
  • Timmy Huang
  • Peng Jin
  • Ann E Jones
  • Saugat Kandel
  • Kandarp Khandwala
  • Robert Kiewisz
  • Dari Kimanius
  • Mykhailo Kopylov
  • Justine Larsen
  • Manuel Leonetti
  • Donghui Li
  • Emma Lundberg
  • Kristen Maitland
  • Gorica Margulis
  • Dannielle McCarthy
  • Elizabeth Montabana
  • Ben Nelson
  • Jun Xi Ni
  • Stephani Otte
  • Mohammadreza Paraan
  • Noeli Pazsoldan
  • Ariana Peck
  • Clinton Potter
  • Janeece Pourroy
  • Dana Sadgat
  • Simon Sander
  • Jonathan Schwartz
  • Daniel Serwas
  • Shu-Hsien Sheu
  • Hannah Siems
  • Trent Smith
  • Andrew Sweet
  • Shivanshi Vaid
  • Madhuri Vangipuram
  • Manasa Venkatakrishnan
  • Carmela Villegas
  • Thorsten Wagner
  • Eric Wang
  • Zun Shi Wang
  • Feng Wang
  • Samantha Yammine
  • Yue Yu
  • Zhuowen Zhao
  • Shawn Zheng
  • Ellen Zhong

Sponsored By:

Chan Zuckerberg Imaging Institute Logo
Chan Zuckerberg Imaging Program Logo

Glossary of Terms

80S ribosome

  • Ribosomes are the molecular machines that translate the genetic information from the intermediary mRNA templates into proteins. The 80S ribosome is specifically the eukaryotic ribosome, and it is abundant in cell lysate since translation of mRNA is a constant activity within the cell.

Apo-ferritin

  • A 24-subunit globular protein in all cells and tissues. It binds and transports iron in all cells and tissues. Apoferritin refers to the iron-free form of the protein.

Annotating

  • The process of identifying proteins or membranes of interest in noisy 3D tomograms.
  • See also labeling and picking.

Contact

Have a question? Reach out to our team at cryoetdataportal@chanzuckerberg.com.