Research with Lamendella Labs: February 2013

Feb 25, 2013

Qiime install disk

Goal: make a quick way to install Qiime.

Methods: the following files were burned to a single DVD.

Qiime 1.6.0 .vdi.gz (zipped Qiime image for the VM)
VirtualBox (VM software for both Mac OSX and Windows)
7-zip (archiving/compression software for Windows)
A README.txt describing the files.

The included README files is as follows:

The two VirtualBox files are installers for a Virtual Machine (VM) that will allow us to easily run Qiime. There is an installer for Windows and another for Mac OSX. (Using Linux? Talk to me.)



QIIME-1.6.0-amd64.vdi.gz is a compressed archive we will run using the VM. It contains all of the Qiime software libraries installed on the operating system Ubuntu 12.04 LTS. We will need to extract this archive and save to our computer before it can be used. It will take up to 30 GB of hard drive space when fully extracted.



7z920-x64.msi is a Windows installer for 7-zip. This program will allow us to extract the very large archive mentioned above. (It will also let us compress and extract other archives, which can save a lot of space.)





This was quickly created by Colin Brislawn on 2/24/2013 for the Lamendella Lab at Juniata College.

Result: DVD was used by the class to install the Qiime VM. Perhaps a compression method that has faster decompression (LZMA2? LZO?) could be used next time. Addition copies of the DVD would also make the process much faster.

Feb 17, 2013

Normalizing reads

Can Qiime analyze paired end reads?
Some parts of the workflow can handle it, but we will analysis from a single end because it that workflow has been established and used before.

When are the sequesces of the 515F or 806R primers?
In our sequences, will we see these primers or their reverse compliments?

In 16S, is the regions of high variability closer to the 515F or 806R primer?

Method to normalize reads:
Our data has already been quality filtered. Because we do not have to worry about quality, we will use the following process because it uses existing qiime scripts (and knowledge Erin's expert knowledge).
Use convert_fastaqual_fastq.py to get .fna and .qual files from the .fastq we have.

convert_fastaqual_fastq.py -f file_you_want_converted.fastq -o output_directory -c fastq_to_fastaqual

To run the files in bulk I added all the files names to a file called test.txt and can the following bash script overnight:

while read line

do time convert_fastaqual_fastq.py -f $line -o R1_fasta -c fastq_to_fastaqual

done < list.txt

I tried running this from an internal hard drive, but it was not faster at all.

I can truncate_fasta_qual_files.py on the resulting .fna and .qual files, using the following script.

while read line; do echo -e "\n\n running $line"; time truncate_fasta_qual_files.py -f $line.fna -q $line.qual -b 150 -o fasta_filtered/; done < files.txt

This runs the truncation script on every file listed in files.txt, which looks like this:

ALXM_S15_L001_R1_001

BCWL_S16_L001_R1_001

CaRM_S17_L001_R1_001

CroRM_S18_L001_R1_001

...

All files are now the right length for the next stage in the pipeline!

Feb 13, 2013

A Quick and Dirty search for Bottlenecks

2 hours of work over 3 hours

Objective: find and remove bottlenecks from a system running qiime
Method: check resource use in the running system. Bottlenecks occur when one resource is the 'limiting reagent' in the speed of a program.

Understand the system
In this case the top part of the running stack was basically:
- Qiime script written in python
- running in the command line in Ubuntu 12.04 64-bit
- in an Oracle VM
- on a server running Windows Server 2008 R2E 64-bit
There are other components of the stack that are not listed here, but these are the ones more available for us to look at.
Look where you can
We can't easily look at everything (CPU cycles, Python). What we can't see, we can't optimize. We are only going to worry about what we can see and change. From the list above, I choose to look at resource use in Ubuntu and Windows because both OSs have good, built-in resource monitors.

System Monitor in Ubuntu
Ubuntu is towards the top of the stack, so I'll look at it first.
1. Click on Dash Home in the top right corner of the screen.
2. Search for 'monitor' and open it. (you can right-click and 'lock to launcher too')
3. It shows use of CPU, Memory (and swap), and network. You can also see number of CPU cores and RAM and all running processes.
Task Manager in Windows
This offers a quick comparison of what's going on in Windows and Ubuntu.
1. Right click on the taskbar. Click 'Start Task Manager'.
2. This shows CPU, Memory, network, and processes. You can also see number of CPU cores and RAM and all running applications, processes, and services.
Resource Monitor in Windows (as an Admin)
This is like Task Manager, but a lot more detailed.
1. Click start and search 'resource'
2. CPU use is divided into processes and services and threads of each.
  RAM is divided by process.
  Disk reads and writes are shown by process. If Disk Access is the bottleneck, this is the quickest way to find it.
Identify resources that are usually fully used...
... and then see if you can increase that resource.

Conclusion: CPU appeared to be the bottleneck. Ubuntu was running on a single core as the VM was assigned only 1 core.

Actions Taken: I closed the script (ctl+c in terminal) and shut down the VM. In the VM's settings, I gave the VM a total of 15 cores and 16 Gb of ram. This improved speed tremendously (~1 order of magnitude). In retrospect, I should have given it more ram, as more was available and the 16 Gb was quickly used up. However, the CPUs were also maxed, so ram may not have been the primary bottleneck.

Feb 7, 2013

Cleaning

I'm scheduled to clean the lab this week and I have never done this before. I dropped by before dinner and Erin showed me the basics.
I tidied up the lab, loaded up one box of pipette tips, readied to autoclave some prepared pipette tips.
With the help of another Aaron, I autoclaved several boxes of tips and returned them to the lab.

Feb 2, 2013

Text Parsing

2.5 hours of work
Objective: Parse sequence files for Steven so they are in the correct format for further analysis.
Method: Get sequence files from Steven. I chose Qiime VM (1.6.0 on Ubuntu) as a platform for editing. Look into using python 2.7.3 to read and write files. Also looked into text parsing and string matching using python. I installed python and its IDE on VM.
Decide that this is taking way, way too long, and a different approach is needed. I have experience using simple regular expressions, so I tried using them.
On Windows 7, download and run Notepad++ from a .zip archive.
Using the built-in find-and-replace function, use following regexp:
Find: /.+ Note the leading space.
Replace: Nothing! (This deletes the selected text)
This selects the first 'space' and 'forward slash' and 'everything' after that on a single line.
I ran this on 4 files, each about 200 mb of text. About 5 minutes total. Each file is about 160 mb after it is edited.
Results: Returned parsed files to Steven.
Moral: Don't reinvent the wheel. Instead attempting to learn new things, I should have tried to solve the problem with tools I am already skilled with and comfortable using. I did not need a fancy script, just a good text editor. Also, regular expressions always come in handy