I always tell people that my favorite aspect of UNIX is the fact that you can quickly and easily string together a few command-line utilities and do things that would require days worth of custom programming to pull off on Windows. There are plenty of UNIX geeks using OS X these days, but I’d wager that the majority of scientific Mac users are still a bit shy when it comes to the terminal and are perhaps a bit glad that it is safely tucked away in the /Applications/Utilities folder. I thought I’d write up a practical example of how you can use the Terminal application and a few handy command-line utilities to make your life easier. In this example I’m going to illustrate how you can automate the retrieval of protein sequence data from NCBI’s genbank database. Note that the principles illustrated in this simple example can easily be applied to numerous other situations.
If you’d like to try this example for yourself you’ll need to launch the terminal application which is found in the /Applications/Utilities folder. In this example we begin with a list of accession numbers from a published study that are given to us in a text file. In NCBI speak, accession numbers are simply the unique identifier for a particular entry in the database. In this case the accession numbers are references to protein sequence data entries. We obviously can’t do much science with a list of accession numbers so we’ll have to retrieve the sequence data and place it into a properly formatted data file. If you lacked programming skills this normally means that you would use a web browser to manually search for each accession number, export the data to the desired format, and paste it into a data file. You can imagine how tedious this gets after the first few accession numbers. Mundane data collection tasks such as this are usually ripe for automation and in this case we can write a very simple command-line statement involving three simple utilities to automate the whole process.
Before we begin let’s look at the contents of the text file containing the accession numbers:
17986270
4503385
118214
76673410
73951844
55666937
51458701
4507041
6753680
30794286
6978777
32483397
118206
as you can see it’s a simple file containing one accession number per line. The next and most difficult part of this task is finding out how we can get data from the NCBI website without manually using a web browser. To do this I used a web browser to manually search the NCBI website using the first accession number and pull up the FASTA formatted sequence data for the accession number I entered. Then I took note of the URL and determined that the list_uids
parameter expected an accession number, and all I have to do to request FASTA formatted sequence data for any accession number is to make a request with list_uids=my_accession_number
. Through some further testing I determined that some of the other parameters in the URL were not required for a successful request so I ended up with the following:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=my_accession_number&dopt=fasta&sendto=t
Now I have all the information I need to assemble the command-line and automate sequence retrieval. The web interaction is going to be performed by the curl
utility which comes standard with OS X. One of the other standard utilities we are going to be using is xargs
. The xargs
utility is able to accept lines of text and use them as parameters to launch other utilities. It might sound simple, but it can be quite powerful if used wisely. Before we begin lets understand fully what we are trying to accomplish. For each accession number listed in the text file, we need to download fasta formatted sequence data and place it into a data file. We already know that we can use the URL above to request sequence data if we replace my_accession_number
with a valid accession number. We know that curl
is used to download data from the web, and that xargs
can accept text lines and pass them on to other commands. I hope you are getting a feeling that our problem can easily be accomplished with curl
and xargs
. We haven’t discussed how we connect these utilities and in order to do that we need to become familiar with the “|” or pipe as it is known. When the pipe character, |
, is placed between two command line statements it tells the terminal environment, or “shell”, to take the output from the first command and use it as the input for the second command. That explanation is a bit simplified but it is all we need to know for the purpose of this example. So the first thing we are going to do is use the cat
utility which, when called with only a file name as an argument, will simply print out each line of the file one after the other. If you run the command yourself you should see the following:
PowerBook-G4-15:~/src/curl_fun$ cat accession_list.txt
17986270
4503385
118214
76673410
73951844
55666937
51458701
4507041
6753680
30794286
6978777
32483397
118206
Now that we can output each accession number in our file, we need to call curl
to request the sequence data for that accession from NCBI. Before we do that, however, we need to make sure that the value for list_uids
in the URL is set equal to each accession number as it is output. To do this we’ll use the xargs
utility to replace a variable character in the URL, in this case %
, with the accession number produced as output from the cat
utility. When we put it all together it looks like the following:
PowerBook-G4-15:~/src/curl_fun$ cat accession_list.txt | xargs -I % curl "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t"
When this command-line is executed, the cat
utility will send each accession number to the xargs
command through the pipe mechanism. For each accession number that is received by xargs
utility the -I option tells xargs
to replace the %
character in the URL with the accession number that it recieves from the via the pipe and then execute the curl
command with the updated URL. Notice that the URL must be enclosed in double quotes otherwise the command will not execute properly. If you run the above command you’ll notice that the sequence data is printed to the terminal. To send this data to the file we need to add one more simple element. The “>
” mechanism tells the shell to take any output from the left side of the character and place it into a file with the file name given on the right side of the character. Let’s call our new file data_file.fasta
and the following is the updated command line:
PowerBook-G4-15:~/src/curl_fun joel$ cat accession_list.txt | xargs -I % curl "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t" > data_file.fasta
When the above command is executed the sequence data will be written to a new file named data_file.fasta. if we wanted to append the sequence data to an already existing data_file.fasta
we would need to use “>>
” mechanism. Now we have a sequence data file that is ready to be used for analysis.
We’ve completed the automation example and as you can see there is not much to it at all. I hope this simple example gives you a taste of what’s available to you through the terminal and how it might help you to automate or augment your research activities.
Leave a Reply