Table of Contents

Before starting

In this workshop we’ll use the Google Cloud to analyze raw metagenomic sequence data to identify the microbial composition of stool from heathly humans compared to Crohn’s disease patients. In addition to generating this microbial census, you’ll also assemble sequences into contigs, which can then be used to infer functional potential. To accomplish these tasks we’ll use Sunbeam, a snake-make based metagenomics pipeline developed by Kyle Bittinger and his group at the PennCHOP Microbiome Center.

The material below is intended to walk you through this workshop, and provide a general web-based lesson plan for how one might conduct such a workshop.

To participate in this workshop, you’ll only need a few things:

  • a laptop computer
  • an internet connection
  • a google account (free)
  • a google cloud account (free sign-up comes with a $300 credit!)

Set-up cloud computer

We’ll begin the workshop with a demonstration of how to launch your first Google Cloud instance to build your cloud computer to have the following specs:

  • 8 cores
  • 50Gb of RAM
  • 100Gb solid state harddrive
  • Running Linux 18.04 LTS operating system

Once you have finalized this instance, you have effectively rented a computer from Google, and we are all using exactly the same type of computer with the same operating system and compute resources. In the case of the computer we set-up above, you will be charged 36 cents per hour, or about $260/month. The more powerful the computer, the more you will be charged in rent, regardless of whether or not you actually use these resources. If you fail to delete your instance after the workshop, your credit card will be charged after ~1 month (when the $300 credit will have been spent).

Install Sunbeam

  • Connect to your cloud computer using the ‘ssh’ button next to the instance.

  • Download Sunbeam from github using the code below

cd ~
git clone -b stable sunbeam-stable
  • Install Sunbeam
cd sunbeam-stable
  • In order to start using Sunbeam, we need to close our ssh terminal and reopen it.

  • Since Sunbeam was installed as a Conda environment, we have to enter this environment to start using the software

source activate sunbeam

This is a command you’ll want to remember for future sessions. Each time you log into your cloud instance, you’ll need to activate the pipeline with source activate sunbeam. Upon activation, you should see that your command prompt begins with “(sunbeam)”. Anytime you want to exit out of sunbeam, simply type source deactivate sunbeam and hit return.

Get data

  • let’s install some additional software in our environment. SRA tools will allow us to easily retrieve raw data from NCBI’s Sequence Read Archive
conda install -c bioconda sra-tools
  • For this workshop, we’ll use data from a recent metagenomics study in Crohn’s disease. This was a large study, but for the purpose of the workshop we’ll only fetch data from 7 patients. Note: contaminating human reads have already been removed from these files. Let’s download these data to our cloud computer using the fasterq-dump function from the SRA tools software.
cd ~
mkdir workshop-data
cd workshop-data
fasterq-dump SRR2145310
fasterq-dump SRR2145329
fasterq-dump SRR2145381
fasterq-dump SRR2145353
fasterq-dump SRR2145354
fasterq-dump SRR2145492
fasterq-dump SRR2145498

Initialize project

cd ~
mkdir workshop-project
sunbeam init workshop-project --data_fp workshop-data
  • Use your nano text editor to explore the samples file and configuration file.

Download reference data

  • We need two reference databases to run our analysis: a database of host DNA sequence to remove, and a database of bacterial DNA to match against.

  • We’ll get the human genome data from UCSC. Filtering against the entire human genome takes too long, so we’ll only filter against chromosome 1.

cd ~
mkdir human
cd human
gunzip chr1.fa.gz
  • Sunbeam requires that the host DNA sequence files end in “.fasta”, so it can find them automatically. Let’s use the mv command to rename this file.
mv chr1.fa chr1.fasta
  • The database of bacterial genomes comes pre-built from the homepage of our taxonomic assignment software, Kraken. We’ll download that using the wget program.
cd ~
tar xvzf minikraken_20171101_4GB_dustmasked.tgz
  • Now that we have reference databases, we need to add them to our configuration file. We’ll use nano to open and modify this file directly in our termial
cd ~
nano workshop-project/sunbeam_config.yml
  • The configuration values are below. You’ll need to navigate to the right spot in your configuration file, and substitute the directory name “kylebittinger” with your google username.
host_fp: "/home/kylebittinger/human"
kraken_db_fp: "/home/kylebittinger/minikraken_20171101_4GB_dustmasked"

Run the pipeline

  • We are ready to actually run the pipeline. All the information about how to run the pipeline is in our configuration file, so we’ll provide that to Sunbeam (--configfile argument). We’ll also let Sunbeam know how many CPU cores we’d like to use (--jobs argument).
cd ~
sunbeam run --configfile workshop-project/sunbeam_config.yml --jobs 7

Generate a report

To look at some of our results, we’ll install a Sunbeam extension and generate a report.

cd ~/sunbeam-stable/extensions
git clone
conda install --file sbx_report/requirements.txt

This is going to install the R programming language on our remote computer, which will take a bit of time. When the installation is complete,

cd ~
sunbeam run --configfile workshop-project/sunbeam_config.yml final_report