Disclaimer: This is my first blog post. That’s right, I’m a complete rookie at this. Why now? Well, since starting my faculty position four years ago and working to help build and direct the Center for Host-Microbial Interactions, I increasingly find myself involved in various collaborations or projects where I’m learning something I find really interesting and that I think would be useful to share with either my lab or with the broader scientific community, but which doesn’t easily translate into a traditional publication. Basically, I’m learning some cool stuff, and it’s not always evident in my publications, so here we are.
Case-in-point, I recently attended this Microbiome Analysis in the Cloud, held at the Institute for Genome Sciences at the University of Baltimore. The two-day workshop had a lot of high points, including excellent planning and preparation on the part of the organizers, a highly skilled staff that worked the room to help troubleshoot, and a program that covered a lot of ground. While that latter point was explicitly stated as a goal of the workshop, it means that I left the workshop feeling I wouldn’t be able to work through a full dataset on my own. Now that I’ve had a chance to review the workshop materials, I feel a bit more comfortable and want to put down my workflow in this blog post. Expect updates in the coming months as I marinate on this.
I also want to acknowledge Joshua Orvis, a bioinformatician at IGS and one of the workshop instructors. Without his one-on-one help and his development of Chiron, this tutorial wouldn’t be possible.
Why use the cloud?
I’m not here to sell you on the idea of cloud computing. In fact, maybe you should ignore all the buzz about ‘the cloud’. After all, you accrue charges as your cloud instance runs. As a consequence, failure to shut down a instance could result in some hefty charges if it slips your mind. Some people cite that it’s a nussiance to move all your data to the cloud to begin working, only to then have to pull your results off the cloud before you shut it down. But the same data gymnastics come into play when you use any remote computing resource. Similarly, folks will often cite the problem that all programs and dependencies needed to carry out your work will have to be installed by you before your cloud instance is useful. The arrival of Docker has largely made this a non-issue, in my opinion. Yes, I’m aware that Docker is viewed with trepidation by some in the bioinformatics community (see here, here, and here), but it seems to me that Docker and cloud computing make good bed fellows.
Still reluctant to dive in? Well, an alternative to using cloud computing resources is to simply invest in your own compute cluster or leverage compute resources at your institution. I have to say, both of these alternatives have some non-trivial downsides as well. Running your own in-house compute cluster gives you tons of control, but with great power comes great responsibility, including that you’ll have to maintain it, which requries sysadmin skills. Acquiring your fancy new computer will also take a serious chunk of change (think 10K or more), and the hardware quickly becomes outdated. My university has a pretty awesome compute cluster, but if your favorite program isn’t available, you likely won’t have the access priveldges to install it yourself and it can take some time for the powers-that-be to get it installed. We also experience frequent interruptions and server downtime on our university cluster. Here’s the bottom line: just like any other resource, cloud computing has its pros and cons, and should be thought of not as the only solution to your problems, but rather as one tool in your bioinformatics toolbox. So, let’s get started!
Fire up your cloud computer
The two most popular cloud computing services are Amazon’s Web Services (AWS) and Google’s Cloud Platform. Amazon, although the best known of the two, feels cumbersome to me – the first hour of the workshop and 36 slides were devoted just to getting our AWS instance up and running. I prefer Google. If you’re still undecided, I’d also point out that Google gives you a $300 credit, good for 1 year from the time you activate your cloud account! This is more than enough cash to work your way through this tutorial and still have plenty left for some of your own analyses.
I put together the video tutorial below to walk you through the follow steps:
- setting up your Google Cloud compute instance
- installing Docker on this instance
- installing Chiron for quick and easy access to a bunch of dockerized programs for metagenomics
- installing the Google Cloud SDK software on your own computer (not the cloud) so you can easily connect to your new cloud resources
- Connecting an FTP client to the instance so you can easily transfer files back and forth.
- tearing it all down when you’re done
Below the video you’ll find all the commands to work through these steps on your own.
Install your programs
Once your gcloud instance is running, click on the ‘ssh’ button next to the instance to open a terminal window. This fast and easy way to connect to your cloud instance is one nice feature of the way gcloud is setup. We’ll now use this ssh connection to install Docker and Chiron.
Install some dependencies that we’ll need for Chiron
Any docker images could be put on your instance at this point. Take a look here to see if your favorite bioinformatics program has been dockerized
Look around your working directory. In particular, take note of all the cool metagenomics tools that are now available in /Chiron/bin
Although the ssh terminal available right on your instance is very convenient, it does not establish a connection between our local computer and the cloud instance (which we must do in order to move files back and forth). In order to do that, we’ll want to connect to our instance from the terminal app on our local computer. Go ahead and launch your Terminal app. Before we do anything else, let’s execute a command in the terminal that will allow us to see hidden files in our directory. We need access to a few of these hidden files for the purposes of this tutorial.
Now install the google SDK.
I my experience, if you encounter any issues with this entire tutorial, it will be with getting the SDK installed and connecting to your instance. For example, you may notice that the installation fails with the following error
This error has to do with the the IPv6 settings on your computer preventing you from being able to connect with a google server to download the SDK command line tools
If you encounter this error, this fix is simple. Begin by temporarily turning off IPv6 support for either Wi-Fi or Ethernet, depending on which one you are using to connect to the internet. If you’re using a Wi-Fi connection, then you would turn-off with:
Now reattempt the installation as you did above
Once you have Google Cloud SDK installed, be sure to turn the IPv6 back on
Connect to your instance
Now we’ll connect to the instance from within our Terminal.
If the above command failed with an authentication error, it’s because this is the first time you’ve run SDK and it isn’t sure that you should have access to your google account from the terminal. Take a moment to authenticate
Launch an interactive session with Chiron
Chiron gives you access to QIIME for processing marker gene sequence data, as well as the BioBakery suite of tools from Curtis Huttenhower’s lab for handling shotgun metagenomic sequencing data. One of the first steps in the BioBakery workflow is using MetaPhlan2 to get species and strain level composition information from raw sequence files. This is a logical place for us to start as well.
Launch the MetaPhlan2 interactive
Set permissions so you can transfer data
Rinse and repeat
Go through the steps below to make sure have mastered this tutorial:
close your terminal, reopen, and make sure you can reconnect to your instance with
relaunch the interactive metaphlan session
Go back to the your google cloud account, select the instance you’ve been working with (checkbox to the left of the instance), and choose ‘Delete’ from the menu bar.
Make sure you can repeat the whole set-up again, except for the installation of the SDK – this only needed to be done once.
Once you’re comfortable with the whole process, you’re ready to move on to part II of this tutorial!