Selecting a Kids First Cancer Cohort¶
The Gabriella Miller Kids First Pediatric Data Portal (KF Portal) hosts datasets at the intersection of congenital birth defects and pediatric cancers, with genomic files for more than 16,000 participants.
Kids First Data Portal
Check out our lessons on Kids First to learn more about the different Data Portal features and building simple to complex queries.
Files on the KF Portal are managed through different access levels. Open access files (including processed files of somatic samples) can be viewed and downloaded by any user. Controlled access files (including raw sequencing files and imaging data) require approvals through dbGaP. For this tutorial, we will use open access pre-processed files generated using Kallisto (v0.43.1), which uses pseudoalignments to quantify transcript abundance from raw data.
KFDRC RNA-Seq Workflow
The Kids First RNA-Seq Workflow uses multiple tools/packages for expression detection and fusion calls. The workflow requires raw FASTQ files (controlled access) as input and generates multiple outputs including the Kallisto transcript quantification files. All the output files of this pipeline are available on the KF Portal as open access data. Due to access restrictions and computational intensity, this tutorial will not cover the Kallisto workflow, but users with their own RNA-Seq data may consider starting from this point.
Step 1: Filter for open access data¶
- Login to the KF Portal .
- Select the File Repository tab.
- Select the Browse All option for the Filter.
Data Summary
At the time this tutorial was written (January 2021), the portal contained a total of 88,728 files. Because new datasets are constantly uploaded to the KF Portal, exact numbers within your query may change slightly when run in the future.
- Select the Access filter listed under FILE field.
- Select Open value.
- Click View Results to update selection. This results in 18,162 files.
Step 2: Apply File Filters to obtain RNA-Seq files¶
Select the File Filters tab and apply the following filters:
- Experimental Strategy:
RNA-Seq
- Data Type:
Gene expression
- File Format:
tsv
This results in 1,477 files.
Step 3: Select cancer type¶
Switch to the Clinical Filter tab and apply the following.
- Diagnosis (Source Text):
Medulloblastoma
andEpendymoma
This filters the number of files to 235.
Step 4: Subset cohort¶
To reduce possible sources of variation due to participant demographics, we will further narrow the query to only include data from white male patients.
Under the Clinical Filters tab select:
- Gender:
Male
- Race:
White
This results in 99 files.
Step 5: Copy files to Cavatica¶
Important
It is crucial to ensure the Cavatica integrations are enabled to allow for file transfers. Find more details in our Push to Cavatica lesson. You do not have to have the Data Repository Integrations set up to continue with this lesson.
- Click on the ANALYZE IN CAVATICA button.
- Select the CREATE A PROJECT option and provide an appropriate name for your folder. This tutorial uses
cancer-dge
as the project name. - Use the SAVE option to create the project.
Following project creation, the option will update to enable copying of the selected files to Cavatica.
Successful copying of the files to the project folder will result in a pop-up box summarizing the details along with a link to view the project folder on Cavatica. If the pop-up box disappears before you have a chance to click on the project link, you can login to Cavatica and follow the steps to view files in Cavatica.
Query link
The KF Portal enables sharing of the query with the unique filter combinations including as a short URL. Login to your KF account and click on the query link to obtain the selected cohort.
In our next lesson, we will explore the newly created project folder and files on the Cavatica platform!
Media resources¶
A video walkthrough of the cancer cohort selection on Kids First Portal: