Skip to content

Upload Data to Cavatica and Edit Metadata

There are several ways in which users can upload data from their local computers or academic clusters to Cavatica. Cavatica provides many tutorials on how to do so for intermediate level users of the interface.

This tutorial is a beginner friendly version for using Cavatica's Command Line Uploader to move fastq files from your AWS Instance on to Cavatica via the command line interface.

Learning Objectives

  • Learn how to upload files to Cavatica
  • Learn to edit metadata of files on Cavatica

~ 30 min

< $1.00

  • AWS account

  • Cavatica account (check out all requirements in the Register for Cavatica page)

  • Basic command line

Visit the AWS tutorial webpage to launch a 64 bit Ubuntu Server 20.04 LTS (HVM), SSD Volume Type instance (t2.micro).

Warning

To avoid unnecessary charges, remember to terminate your AWS instance once you are done using it.

Step 1: Update Instance

LTS 20.04 is frozen at version 20.04, and thus it may be preferable to update the packages and dependencies to their latest version. Prior to the local instance upgrade, you can obtain the information on packages that have updates available.

sudo apt update

To perform the actual software upgrade of the listed packages use the upgrade option:

sudo apt upgrade -y

This will list the packages that will be upgraded and ask for permission to continue. Alternatively, the two commands can be combined into one command using &&:

sudo apt update && sudo apt upgrade -y

Step 2: Download example data

Next, make a directory called "fastq" using the command mkdir, and then download some example fastq data that you can move from AWS to Cavatica:

mkdir fastq
cd fastq
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz

Curl is an open source software that transfers data. The -L flag redirects the user to the right URL if the server reports that the requested page has been moved. The -o or the --output flag saves the data into a local file. These example files are from Schurch et al, 2016 yeast RNA-Seq study. The exact nature of the files does not matter for this tutorial. Any file type may be used instead of the fastq.

Step 3: Install Java and Download Command Line Uploader

The Cavatica Command Line Uploader needs java version "1.8.0_20". Ubuntu Server 20.04 LTS (HVM) does not come with java pre-installed. You will need to install it with this command:

cd
sudo apt install -y openjdk-8-jre-headless

The command cd takes you to the home directory. You can check to see if the installation of java was successful.

java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

Next, download the Cavatica Uploader by running this code:

curl -LO https://cavatica.sbgenomics.com/downloads/cli/cavatica-uploader.tgz

The -O flag names the local file the same as its remote counterpart.

Now uncompress the Cavatica Uploader by running:

tar zxvf cavatica-uploader.tgz -C ~

The tar (like gzip and zip) command is used to compress and uncompress a collection of files. It is the most widely used command to create compressed files that are easy to move. Here the z flag unz̲ips the file, x ex̲tracts files from the archive, v prints the filenames v̲erbosely and f means the following argument is a f̱ilename. By default, this command will extract the contents of ".tgz" into your working directory. You can override this behavior using the -C flag at the end of the command. The -C flag allows you to specify a directory into which the contents of the tar file should be moved. In our case, we are using the ~ sign as a short form for the "home" directory.

Step 4: Test the Command Line Uploader

Check if the Uploader works by running this code:

~/cavatica-uploader/bin/cavatica-uploader.sh -h
ubuntu@ip-172-31-26-145:~$ ~/cavatica-uploader/bin/cavatica-uploader.sh -h
Upload files to Cavatica
usage: cavatica-uploader.sh [-h] [-l] [-p id] [-t token] [-x url] file ...
-a,--automation              Start automation from manifest file.
                             This option must be used together with
                             --manifest-file.
--dry-run                    Dry run the upload (manifest) and/or
                             metadata setting process.
-f,--folder <arg>            Specify optional folder, inside of a
                             specified project, to upload the files
                             into.
                             You can specify nested folder structure
                             separated by the path separator `/`.
                             If any of the specified folders is
                             missing it will be created.
-h,--help                    Print a short usage summary.
-l,--list-projects           Print a list of projects available as
                             upload targets. The output is a
                             tab-separated list of project IDs and
                             names.
--list-tags                  Print a list of tags present in a project
                             and exit.
                             This option must be used together with
                             --project.
-mf,--manifest-file <arg>    Specify manifest tabular file to set
                             metadata.
                             This option must be used together with
                             --project.
-mm,--manifest-metadata <arg>   Parse metadata from manifest file.
                             You can list metadata field names as
                             argument to this option.
                             This option must be used together with
                             --manifest-file.
-p,--project <arg>           Specify the ID of the project to upload
                             files to.
-pf,--preserve-folders       Should the folder structure for specified
                             input folders be preserved while
                             uploading recursively.
                             By default, files encountered in the
                             nested folders are `flattened`, and
                             uploaded into root target folder.
                             With this flag, inner folders are created
                             along the way, and files are uploaded
                             into them.
--skip-partial               Do not attempt to resume incomplete
                             uploads.
                             If omitted, the uploader will resume an
                             upload when the local file matches in
                             name and size.
-t,--token <arg>             Specify an authorization token.
--tag <arg>                  Apply tag <arg> to all the files in this
                             upload.
                             This option may appear multiple times.
-u,--username <arg>          Specify username.
                             If omitted and not using the -t option,
                             user will be prompted for a username.
-x,--proxy <arg>             Specify a proxy server through which the
                             uploader should connect.
                             The URL to the proxy server in the form
                             proto://[user:pass]@host[:port].
                             Proto can be `http' or `socks'. Supports
                             SOCKS4 and SOCKS5.
The program outputs a tab-separated list of newly created remote file IDs
and local file names.
Complete documentation is available at:
http://docs.sevenbridges.com/docs/upload-via-the-command-line

You can add the program to the instance's PATH variable to avoid using the full path for execution.

export PATH=$PATH:~/cavatica-uploader/bin/
echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ubuntu/cavatica-uploader/bin/

Now the program can be called from any directory on your instance using the name cavatica-uploader.sh. You can also create an alias for the program to shorten the name.

alias uploader=cavatica-uploader.sh
uploader -h
ubuntu@ip-172-31-26-145:~$ cavatica-uploader.sh -h
Upload files to Cavatica
usage: cavatica-uploader.sh [-h] [-l] [-p id] [-t token] [-x url] file ...
-a,--automation                 Start automation from manifest file.
                             This option must be used together with
                             --manifest-file.
--dry-run                    Dry run the upload (manifest) and/or
                             metadata setting process.
-f,--folder <arg>            Specify optional folder, inside of a
                             specified project, to upload the files
                             into.
                             You can specify nested folder structure
                             separated by the path separator `/`.
                             If any of the specified folders is
                             missing it will be created.
-h,--help                    Print a short usage summary.
-l,--list-projects           Print a list of projects available as
                             upload targets. The output is a
                             tab-separated list of project IDs and
                             names.
--list-tags                  Print a list of tags present in a project
                             and exit.
                             This option must be used together with
                             --project.
-mf,--manifest-file <arg>       Specify manifest tabular file to set
                             metadata.
                             This option must be used together with
                             --project.
-mm,--manifest-metadata <arg>   Parse metadata from manifest file.
                             You can list metadata field names as
                             argument to this option.
                             This option must be used together with
                             --manifest-file.
-p,--project <arg>              Specify the ID of the project to upload
                             files to.
-pf,--preserve-folders          Should the folder structure for specified
                             input folders be preserved while
                             uploading recursively.
                             By default, files encountered in the
                             nested folders are `flattened`, and
                             uploaded into root target folder.
                             With this flag, inner folders are created
                             along the way, and files are uploaded
                             into them.
--skip-partial               Do not attempt to resume incomplete
                             uploads.
                             If omitted, the uploader will resume an
                             upload when the local file matches in
                             name and size.
-t,--token <arg>             Specify an authorization token.
--tag <arg>                  Apply tag <arg> to all the files in this
                             upload.
                             This option may appear multiple times.
-u,--username <arg>          Specify username.
                             If omitted and not using the -t option,
                             user will be prompted for a username.
-x,--proxy <arg>             Specify a proxy server through which the
                             uploader should connect.
                             The URL to the proxy server in the form
                             proto://[user:pass]@host[:port].
                             Proto can be `http' or `socks'. Supports
                             SOCKS4 and SOCKS5.
The program outputs a tab-separated list of newly created remote file IDs
and local file names.
Complete documentation is available at:
http://docs.sevenbridges.com/docs/upload-via-the-command-line

PATH

Adding the program to the $PATH variable will only last the length of the AWS session.

Step 5: Move Files

Step 5a: Find your Cavatica Authentication Token and Username

The Authentication Token is a 32 character length personalized code in Cavatica that allows other programs to get access to your Cavatica account. You can find the Cavatica Authentication token in your Cavatica account under the "Developer" tab.

Copy the Authentication token. You will replace a??????????????????????????????? in the code block below with your own token.

Next, find and remember your username visible at the top right corner of your Cavatica account page. Replace username in the code block below with your personal username.

Step 5b: Choose a Cavatica Project

You can either create a new project or choose an existing project.

New project

Create a new Cavatica project by clicking on the "Projects" tab on the Cavatica homepage and selecting the " + Create a project" option. You can name your new project whatever you like. Use your project name to replace "project-name" in the code block below.

Existing project

Alternatively, you may choose to select an existing project. To get a list of all existing projects to choose from, run this code:

uploader -t a??????????????????????????????? --list-projects

The -t flag tells AWS to look for an Authentication token. Remember to replace a??????????????????????????????? with your own Authentication token.

Important

If you have underscores _ in your project name, replace them with - in the uploader code.

Step 5c: Moving Files

Finally, you can transfer files by running the following code. Remember to replace "project-name" with the name of your project and "username" with your Cavatica login name.

cd ~/fastq
uploader -t a??????????????????????????????? -p username/project-name *.fastq.gz
Initializing upload...
Starting upload of 2 item(s) to 'project-name'
Uploading file '/home/ubuntu/fastq/ERR458493.fastq.gz'
5fadb1cee4b05495de67d7ea    /home/ubuntu/fastq/ERR458493.fastq.gz    100%
Uploading file '/home/ubuntu/fastq/ERR458494.fastq.gz'
5fadb1dce4b05495de67d7f1    /home/ubuntu/fastq/ERR458494.fastq.gz   45.52%

By running this code, you are moving files from your AWS instance to the "project-name" project within Cavatica. The * wildcard copies all the files with extension ".fastq.gz". The -p flag tells AWS which Cavatica project to put the files into. If you wish, you could use the -f flag to create a subfolder inside the project (not shown here).

You're all done! Log in to Cavatica and look for your files in the the "Files" tab in your project.

Step 6: Edit Metadata

Important

The name of the example project we are using in this section of the tutorial is called "sim_fastq". Your project name will be different. To practice, we recommend following along using the practice yeast fastq files ("ERR458493.fastq.gz" and "ERR458494.fastq.gz") that you previously pushed to the "project-name" project.

  • First, click into your desired Project and then select the "Files" tab.

  • You should see all your files listed here.

  • Next, select the files whose metadata you wish to edit by checking the box next to the file name.

  • Click the "Edit Metadata" button.

  • You should see a pop-up window on the right side of the screen:

  • Fill in or edit all the metadata terms you wish to use for your analysis and click "Save".

  • Your new metadata terms should now be displayed on your screen!

Don't see your metadata column of interest?

  • Click on the table icon on the right hand side of the page.
  • Check all the column names you wish to add to the metadata display.

An alternative method to edit metadata terms can be found in the Cavatica documentation page under the tab: Modify metadata using the visual interface. Detailed instructions coming soon.


Last update: August 9, 2021