Frequently Asked Questions

Last Updated: April 25, 2024

We hope these answers will help you get the most out of the CryoET Data Portal! If you need additional information or assistance, you can reach us by submitting a Github Issue. For help with submitting an issue, follow these instructions.

Did you encounter a bug, error, or other issue while using the portal? Submit an issue on Github to let us know!

To submit an issue, you'll need to create a free Github account. This allows our team to follow up with you on Github if we have a question about the problem you encountered. Then, fill out this form. We suggest you use a descriptive title, paste an error messages using the <> icon on the form, and provide as many details as possible about the problem, including what you expected to happen and what type of machine you were using.

For more information about submiting issues on Github, please refer to Github's documentation.

The CryoET Data Portal uses the following data schema:

  1. A dataset is a community contributed set of image files for tilt series, reconstructed tomograms, and if available, cellular and/or subcellular annotation files. Every dataset contains only one sample type prepared and imaged with the same conditions. The dataset title, such as S. pombe cryo-FIB lamellae acquired with defocus-only, summarizes these conditions. Samples can be a cell, tissue or organism; intact organelle; in-vitro mixture of macromolecules or their complex; or in-silico synthetic data, where the experimental conditions are kept constant. Downloading a dataset downloads all files, including all available tilt series, tomograms, and annotations.
  2. A run is one experiment, or replicate, associated with a dataset, where all runs in a dataset have the same sample and imaging conditions. Every run contains a collection of all tomography data and annotations related to imaging one physical location in a sample. It typically contains one tilt series and all associated data (e.g. movie frames, tilt series image stack, tomograms, annotations, and metadata), but in some cases, it may be a set of tilt series that form a mosaic. When downloading a run from a Portal page, you may choose to download the tomogram or all available annotations. To download all data associated with a run (i.e. all available movie frames, tilt series image stack, tomograms, annotations, and associated metadata), please refer to the API download guide.
  3. An annotation is a point or segmentation indicating the location of a macromolecular complex in the tomogram. On the run page, you may choose to download tomograms with their annotations.

You can refer to a graphic of the data schema here.

The Data Portal's S3 bucket is public, so it can be accessed without creating an account with AWS, simply add --no-sign-request in your commands as shown below. Using the instructions below, you can get started downloading data in only a few minutes. For more detailed instructions, please refer to the documentation here.

  1. Download the installer: MacOS Installer Download / Windows Installer Download
  2. Open the installer and complete installation following the prompts. (No further steps, since sign-in credentials ARE NOT needed to use the tool.)
  3. Open terminal (MacOS) or command prompt (Windows).
  4. Copy and paste the command from the download prompt for the desired data into terminal / command prompt and hit enter.
  5. Alternatively, create a custom command inserting the S3 URL of the data and the desired download destination in the spaces provided.

To download a single file, use cp:

aws s3 cp --no-sign-request [S3 bucket URL] [Local destination path]

To download multiple files, use sync

aws s3 sync --no-sign-request [S3 bucket URL] [Local destination path]

For example, to download a particular JSON file of tomogram metadata into a folder called "Downloads" use:

aws s3 cp --no-sign-request s3://cryoet-data-portal-public/10000/TS_026/Tomograms/VoxelSpacing13.48/CanonicalTomogram/tomogram_metadata.json ~/Downloads/

In the above example, the download happened very quickly because the file was only about 1 kB in size. However, typical tomograms are multiple GB, so expect downloading to take 30-60 mins for a single tomogram for a given run, but downloading could take as long as days depending on the number and sizes of the files. To speed up download, you can follow these instructions to optimize download speed

All tomograms in the Data Portal are viewable in Neuroglancer along with their annotations. You can open a tomogram in Neuroglancer by clicking the blue View Tomogram button on any run page in the Portal. This will open an instance of Neuroglancer in a separate tab of your browser with the selected data along with their annotations already loaded. For more information about visualizing data with Neuroglancer, check out the documentation from Connectomics, the team that develops Neuroglancer, here.

The CryoET Data Portal napari plugin can be used to visualize tomograms, annotations, and metadata. Refer to this documentation to learn about how to use the plugin and to this page to learn more about napari and CryoET Data Poral.

  • The Dataset, Run, and TomogramVoxelSpacing classes have download_everything methods which allow you to download all data associated with one of those objects.

  • The Tomogram class has download_mrcfile and download_omezarr methods to download the tomogram as a MRC or OME-Zarr file, respectively.

  • The TiltSeries class has download_mrcfile and download_omezarr methods as well as download_alignment_file, download_angle_list, and download_collection_metadata to download the files associated with a tilt series.

All of the download methods default to downloading the data to your current working directory, unless a destination path is provided. The general structure of these commands is object.download_method(OPTIONAL DESTINATION PATH). For example, to download the TS_026 tomogram in OME-Zarr format to your current working directory use:

# Instantiate a client, using the data portal GraphQL API by default
client = Client()

# Query the Tomogram class to find the tomogram named TS_026
tomo = Tomogram.find(client, query_filters=[Tomogram.name == "TS_026"])

# Download tomogram
tomo.download_omezarr()

For more examples of downloading data with the API, check out the tutorial here. The Data Portal API reference can be found here.

Every class in the Data Portal API has a find method which can be used to select all objects that match criteria provided in a query. The find method utilizes python comparison operators ==, !=, >, >=, <, <=, as well as like, ilike, and _in methods used to search for strings that match a given pattern, to create queries.

  • like is a partial match, with the % character being a wildcard
  • ilike is similar to like but case-insensitive
  • _in accepts a list of values that are acceptable matches.

The general structure of these commands is class.find(client, query_filters=[LIST QUERIES HERE]). For example, the script below will print the names of all runs that have "ts" in their name and more than 900 pixels in their "fast" axis.

from cryoet_data_portal import Client, Run

# Instantiate a client, using the data portal GraphQL API by default
client = Client()

# Query the Run class for runs with "TS" (case-insensitive) in their name and x pixels > 900
runs_list = Run.find(client, query_filters=[Run.name.ilike("%TS%"), Run.tomogram_voxel_spacings.tomograms.size_x > 900])

for run in runs_list:
    print(run.name)

For more examples of using the find operator, check out the tutorial here. The Data Portal API reference can be found here.

The tilt series quality score/rating is a relative subjective scale meant for comparing tilt series within a dataset. The contributor of the dataset assigns quality scores to each of the tilt series to communicate their quality estimate to users. Below is an example scale based mainly on alignability and usefulness for the intended analysis.

RatingQualityDescription
5ExcellentFull Tilt Series/Reconstructions could be used in publication ready figures.
4GoodFull Tilt Series/Reconstructions are useful for analysis (subtomogram averaging, segmentation).
3MediumMinor parts of the tilt series (projection images) need to be or have been discarded prior to reconstruction and analysis.
2MarginalMajor parts of the tilt series (projection images) need to be or have been discarded prior to reconstruction and analysis. Useful for analysis only after heavy manual intervention.
1LowNot useful for analysis with current tools (not alignable), useful as a test case for problematic data only.

The dataset identifier in the API refers to the Dataset ID provided in the Portal. This number is assigned by the Data Portal as a unique identifier for a dataset and is used as the directory name in the data filetree.

Descriptions of all terminology and metadata used in the Portal is provided here.

There is no definitive rule for which annotations are displayed with a tomogram in Neuroglancer by default. The annotations are manually chosen to display as many annotations as possible without overlap or occlusion. For example, when the cytoplasm is annotated as a whole, it would occlude other annotations included within, such as protein picks. When there is a ground truth and predicted annotation, the ground truth annotation is displayed by default. Authors contributing data can specify the desired default annotations during the submission process.

The CryoET Data Portal napari plugin can be used to visualize tomograms, annotations, and metadata. Refer to this documentation to learn about how to use the plugin and to this page to learn more about napari and CryoET Data Portal.

Thank you for considering submitting data to the Portal!

Contributions can be raw data (tilt series and movie frames) + resulting tomograms, a new tomogram for existing raw data in the Portal generated using a different algorithm, and/or annotations of existing tomograms. We encourage all contributions, including those which may be of lower quality than existing datasets on the Portal, as these datasets are useful for developing better annotation and data processing algorithms.

We will work with you to upload the data to the Portal. Please fill out this contribution form, which is also found through the Tell Us More button on the bottom of the Portal homepage. We will then reach out to you to start the process of uploading your data. We have a ~6 month release cycle, so please allow time for the data to become available through the portal.

In the future, we plan to implement a self-upload process so that users can add their data to the Portal on their own.