Goal: given two lists of sequences, find the things that appear in both lists.
Methods: primer prospector filtered a list of potential primers down to the ones least likely to form secondary structure. Based on this list of 217 'good' primers, we needed to make a list of which wells those primers are in. So how do we compare these two lists?
In Excel 2013, the list of 'good' primers was added of the column containing all primers.
This column was selected then
home > conditional formatting > highlight cell rules > duplicate values
all columns with data were selected then
data > filter
In the column with barcodes, the drop-down box in the top cell was clicked then
filter > by color > cell color > (choose color from)
Results: This produces a list of barcodes which are identified as 'good' by primer prospector with their identifying plate number and position for easy access.
See also: Find duplicate values fast by applying conditional formatting
1 hour
Oct 21, 2013
Oct 14, 2013
Denoising on the Cluster
Goals: Denoise 454 data on the HHMI cluster.
Methods: The following script was submitted:
Results: the database was built successfully and the filtering step runs
New Goals: fully use three nodes while denoising
Methods: cancel denoising, AFTER I confirm with the Qiime developers that I am correctly using the script to resume denoising.
To resume denoising, I should be able to run:
Kyle suggested
10 hours over 3 days
Methods: The following script was submitted:
denoise_wrapper.py -v -i GPWS.sff.txt,GSTY.sff.txt,GSAN.sff.txt,GSAI.sff.txt \
-f combined_seqs_100213.fna -m soil_master_metadata_072513.txt \
-o combined_denoising_output/ -n 96 --titanium
Results: the database was built successfully and the filtering step runs
FlowgramAli_4fr
96 times. However, all 96 of these threads run on one node, insteas of three.New Goals: fully use three nodes while denoising
Methods: cancel denoising, AFTER I confirm with the Qiime developers that I am correctly using the script to resume denoising.
To resume denoising, I should be able to run:
mpirun denoiser.py -v -i GPWS.sff.txt,GSTY.sff.txt,GSAN.sff.txt,GSAI.sff.txt \
-f combined_seqs_100213.fna -m soil_master_metadata_072513.txt \
-o combined_denoising_output_resumed/ -p combined_denoising_output/ --checkpoint_fp combined_denoising_output/checkpoints/checkpoint50.pickle \
-c -n 96 --titanium
Kyle suggested
mpirun
, to balance these threads between all nodes.10 hours over 3 days
Oct 11, 2013
Additional sample collection and DNA extraction
Goals: take additional samples from Otto's and from commercial vendors.
Methods: our group had lunch at Otto's while Chris took five additional samples from completed beer. After, we purchased two of these beers in bottles from a local store. Back in lab, up to 120 ml of all samples were filtered through Sterivex millipore filters. This makes for 7 samples today and 11 samples overall.
Goals: extract DNA
Methods: the Mo Bio kit was used to extract DNA from the filters after filters were sliced into thin strips. A Qubit® 2.0 Fluorometer was used to measure DNA concentrations in extracted samples.
Results: Of the 11 samples, only 3 produced delectable quantities of DNA (>5 ng/ml).
Next: get all possible DNA from existing samples
15 hours over two days
Methods: our group had lunch at Otto's while Chris took five additional samples from completed beer. After, we purchased two of these beers in bottles from a local store. Back in lab, up to 120 ml of all samples were filtered through Sterivex millipore filters. This makes for 7 samples today and 11 samples overall.
Goals: extract DNA
Methods: the Mo Bio kit was used to extract DNA from the filters after filters were sliced into thin strips. A Qubit® 2.0 Fluorometer was used to measure DNA concentrations in extracted samples.
Results: Of the 11 samples, only 3 produced delectable quantities of DNA (>5 ng/ml).
Next: get all possible DNA from existing samples
15 hours over two days
Sep 21, 2013
VM image, qiime 1.7.0, and wash bottles
Miscellaneous day in lab.
The VM image of qiime 1.6.0 which is provided by the Knight Lab was fully updated along the lines of my previous guide. This should reduce the configuration time needed for other students in lab.
After the cluster was backed up, I deployed qiime 1.7.0 to /share/apps/qiime-1.7.0/. After some of the manual fixes (like manual addition of pplacer), I made a new module called qiime-1.7.0 based on the new activate.sh. Running
I refilled the wash bottles used for sterilization. We primarily use 70% denatured ethanol and 10% bleach solutions (both are percents by volume).
3-4 hours
The VM image of qiime 1.6.0 which is provided by the Knight Lab was fully updated along the lines of my previous guide. This should reduce the configuration time needed for other students in lab.
After the cluster was backed up, I deployed qiime 1.7.0 to /share/apps/qiime-1.7.0/. After some of the manual fixes (like manual addition of pplacer), I made a new module called qiime-1.7.0 based on the new activate.sh. Running
module load qiime-1.7.0
then print_qiime_config.py -t
confirms that the newer qiime is functioning on the cluster. We need to make sure everything is stable by restarting our cluster, then we should have all the qiimes we need.I refilled the wash bottles used for sterilization. We primarily use 70% denatured ethanol and 10% bleach solutions (both are percents by volume).
3-4 hours
Sep 12, 2013
we have an OTU table!
Goal: get an OTU table using denoised data
Methods:
On our cluster, with qiime 1.6.0:
then
On EC2 running qiime 1.7.0:
Back on our cluster with qiime 1.6.0:
Result: We have an OTU table called
Methods:
On our cluster, with qiime 1.6.0:
pick_otus.py -i combined_denoised_seqs.fna -z -r /share/apps/qiime_software/gg_otus-12_10-release/rep_set/97_otus.fasta -m uclust_ref --uclust_otu_id_prefix qiime_otu -o uclust_ref_gg12_
then
pick_rep_set.py -i uclust_ref_gg12_/combined_denoised_seqs_otus.txt -f combined_denoised_seqs.fna -r /share/apps/qiime_software/gg_otus-12_10-release/rep_set/97_otus.fasta -o pick_rep_set
On EC2 running qiime 1.7.0:
parallel_assign_taxonomy_rdp.py -i /home/ubuntu/data/soil/pick_rep_set.fasta -O 8 --rdp_max_memory 4000 -o /home/ubuntu/data/soil/tax_assign_out2
Back on our cluster with qiime 1.6.0:
make_otu_table.py -i combined_denoised_seqs_otus.txt -t pick_rep_set_tax_assignments.txt -o soil_otu_table.biom
Result: We have an OTU table called
soil_otu_table.biom
! More info about it:Num samples: 61
Num otus: 12528
Num observations (sequences): 646884.0
Table density (fraction of non-zero values): 0.1284
Seqs/sample summary:
Min: 3279.0
Max: 33718.0
Median: 9823.0
Mean: 10604.6557377
Std. dev.: 5310.3842468
Median Absolute Deviation: 3709.0
Default even sampling depth in
core_qiime_analyses.py (just a suggestion): 3279.0
Sample Metadata Categories: None provided
Observation Metadata Categories: taxonomy
Sep 10, 2013
assign_taxonomy on EC2
Goal: using qiime 1.7.0 on EC2 to assign taxonomy to soil OTUs.
Methods: This script was used by Ryan for the fracking project, and we used it again.
The file run_soil.sh:
We then ran
1.5 hours
Methods: This script was used by Ryan for the fracking project, and we used it again.
The file run_soil.sh:
#!/bin/bash
nohup echo "start time: $(date)"
nohup time \
parallel_assign_taxonomy_rdp.py \
-i /home/ubuntu/data/soil/pick_rep_set.fasta \
-O 8 \
--rdp_max_memory 4000 \
-o /home/ubuntu/data/soil/tax_assign_out/
nohup echo "end time: $(date)"
We then ran
./run_soil.sh &
to use this script.1.5 hours
Sep 8, 2013
denoising done! OTU picking on our Cluster
Goals: pick OTUs on the HHMI Cluster
This script was run with
OTUs with the following script:
We then ran
Then we ran
Results: OTUs were picked very quickly (15 minutes). A total of 12528 OTUs were found, 8638 of which were new.
Picking the rep set was also very fast.
Assigning taxonomy with RDP hangs on qiime 1.6.0. This is a known issue, which has been fixed in 1.7.0. We could get qiime 1.7.0 running on our cluster or use and EC2 instance.
4 hours with great happiness and sadness.
Denoising Methods:
454 sequences were denoised using the following script, which was calledrerun.sh
.rm out/ -Rf
rm nohup.out
echo "Start time: $(data)"
denoise_wrapper.py -v -i GSTY.sff.txt \
-f GSTY_s20_seqs.fna \
-m GSTY_mapping.txt \
-o out/ -n 8 --titanium
echo "End time: $(data)"
This script was run with
nohup ./rerun.sh &
On our Cluster
We remove completed files from EC2 instances, usedcat
to combine the sequences (.fna files) into combined_seqs.fna, and uploaded this to out cluster. The file 97_outs.fast
from GreanGenes gg_13_5
OTUs with the following script:
pick_otus.py -i combined_denoised_seqs.fna -z -r /share/apps/qiime_software/gg_otus-12_10-release/rep_set/97_otus.fasta -m uclust_ref --uclust_otu_id_prefix qiime_otu -o uclust_ref_gg12_
We then ran
pick_rep_set.py -i uclust_ref_gg12_/combined_denoised_seqs_otus.txt -f combined_denoised_seqs.fna -r /share/apps/qiime_software/gg_otus-12_10-release/rep_set/97_otus.fasta -o pick_rep_set
Then we ran
parallel_assign_taxonomy_rdp.py -i pick_rep_set.fasta -o rdp_assigned_taxonomy/ -O 32
Results: OTUs were picked very quickly (15 minutes). A total of 12528 OTUs were found, 8638 of which were new.
Picking the rep set was also very fast.
Assigning taxonomy with RDP hangs on qiime 1.6.0. This is a known issue, which has been fixed in 1.7.0. We could get qiime 1.7.0 running on our cluster or use and EC2 instance.
4 hours with great happiness and sadness.
Sep 6, 2013
Informatic tools in our lab
Remote Machines
On your computer, search for and then launch Remote Desktop.Enter the server name or IP address. Some of our servers and their IPs are listed below
- Basement lab PC: 10.39.4.1
- VLCS 1062 PC:
- GCAT-SEEK server: gcatseek
Your Virtual Machine running QIIME
Get one of the installation DVDs from Dr. L and follow the instructions in the readme file. You can also follow the official documentation.Before starting the VM, check the resource load on the system and adjust your settings accordingly. You are now technically ready to use Qiime, but I recommend these additional adjustments.
The HHMI Cluster
Send an email to Dr. Lamendella or Colin Brislawn. Or speak with us in lab.If you already have access, you can use ssh to connect. The IP address is 10.39.6.10
Setting up the Qiime VM
Goal: Get Qiime running the the VirtualBox and fine-tune its settings.
Methods: The Qiime pipeline is distributed as a Virtual Machine (VM). This way, the complex pipeline 'just works.' At least that's the hope. These are the steps I took while setting up Qiime 1.7.0.
Installing the Qiime Virtual Box is very well documented. I followed those steps. Because I'm on Windows, I used 7-zip to open the compressed .gz file. Everything else is the same.
After following these instructions, I opened the Virtual Machine (VM) to make sure it was working. I also did the following things.
4 hours
Methods: The Qiime pipeline is distributed as a Virtual Machine (VM). This way, the complex pipeline 'just works.' At least that's the hope. These are the steps I took while setting up Qiime 1.7.0.
Installing the Qiime Virtual Box is very well documented. I followed those steps. Because I'm on Windows, I used 7-zip to open the compressed .gz file. Everything else is the same.
After following these instructions, I opened the Virtual Machine (VM) to make sure it was working. I also did the following things.
- Installed the VirtualBox Guest Additions using the disk icon on the left side of the screen.
- Installed the Synaptic Package Manager using the Ubuntu Software Center.*
- Installed the package 'nautilus-open-terminal'*
- Installed these packages: ipython, ipython-notebook-common, and ipython-notebook
- Did updates (all of them, I think... In Qiime 1.6.0, some updates caused problems. I don't remember any problems in Qiime 1.7.0)*
- Changed some Settings: (first, I shut down the VM)
- Gave my VM access to more memory and processor cores. (in settings>system)
- Made Shared Clipboard and Drag'n'Drop bidirectional (settings>general>advanced)*
- Connected the Shared_Folder on my virtual desktop, to a folder in my real computer. (settings>Shared Folders>Add Shared Folder)
- Pinned Synaptic Package Manager, System Monitor, and Files to my Dashboard.*
- Opened Terminal and ran the script 'print_qiime_config.py' (And it worked!)
4 hours
Sep 2, 2013
Get files from a finished EC2 Instance
Objective: Download our files from an EC2 instance on which denoising has finished.
Method: log into AWS and go to your instances.
Start the instance: Right-click > Start. (For long runs, we usually set an alarm to shutdown the instance when CPU use dropped to zero. So we have to start it back up again to download our data.)
Remove any alarms which may Stop your instance. In the column titled 'Alarm Status,' click on the link then the name of the alarm, then make sure 'Take this action' is unchecked.
Connect to the instance with ssh: Right-click > Connect (You can also use Terminal on Mac or Putty on Windows) Change user name to 'ubuntu' and select the path to the .qem certificate.
Remount Volumes. (If you need, you can check which volumes are attached.) Run
Connect to the instance with Cyberduck.
Download ALL the files!
You may consider compressing your files to save download time. In the directory, pick a file and type
You can compress an entire folder with
Compression is particularly good with large repetitive files, so it's perfect for sequence data.
Results: The denoiser.log, centroids.fasta, singletons.fasta, denoiser_mapping.txt, denoised_clusters.txt, and denoised_seqs.fasta were downloaded from the two finished Instances.
3 hours over 5 hours
Method: log into AWS and go to your instances.
Start the instance: Right-click > Start. (For long runs, we usually set an alarm to shutdown the instance when CPU use dropped to zero. So we have to start it back up again to download our data.)
Remove any alarms which may Stop your instance. In the column titled 'Alarm Status,' click on the link then the name of the alarm, then make sure 'Take this action' is unchecked.
Connect to the instance with ssh: Right-click > Connect (You can also use Terminal on Mac or Putty on Windows) Change user name to 'ubuntu' and select the path to the .qem certificate.
Remount Volumes. (If you need, you can check which volumes are attached.) Run
sudo mount /dev/xvdf/ $HOME/data/
. Then check that the data folder contains the files you need.Connect to the instance with Cyberduck.
Download ALL the files!
You may consider compressing your files to save download time. In the directory, pick a file and type
gzip YourFileName.fna/
.You can compress an entire folder with
tar -czvf YourFolderName/
.Compression is particularly good with large repetitive files, so it's perfect for sequence data.
Results: The denoiser.log, centroids.fasta, singletons.fasta, denoiser_mapping.txt, denoised_clusters.txt, and denoised_seqs.fasta were downloaded from the two finished Instances.
3 hours over 5 hours
Aug 30, 2013
using primer prospector
Objective: select good primers for sequencing a section of the mer operon.
Methods: We continued to use the script
The overall speed of our testing was greatly increased by using files on the Desktop of the VM running on an internal HDD. This was much faster than using files from a VM stored on an external USB 2.0 drive, or files accessed through the Shared_Folder.
Results: Every primer-barcode combination was eliminated because all pairs produced secondary structures. This seams unrealistic, leading us to believe that we are using the script wrong. This could also happen if there is conflict between our primers and Illumina adapter.
3 hours
Methods: We continued to use the script
check_primer_barcode_dimers.py
to test ~1000 barcodes combined with a pair of primers. Abby has list of primers tested.The overall speed of our testing was greatly increased by using files on the Desktop of the VM running on an internal HDD. This was much faster than using files from a VM stored on an external USB 2.0 drive, or files accessed through the Shared_Folder.
Results: Every primer-barcode combination was eliminated because all pairs produced secondary structures. This seams unrealistic, leading us to believe that we are using the script wrong. This could also happen if there is conflict between our primers and Illumina adapter.
3 hours
Aug 27, 2013
pprospector-1.0.1 works!
Objective: get primer prospector to work.
Methods: continuing on from yesterday using Qiime 1.7.0 VM. This includes pprospector-1.0.1 and veinna-1.8.4. Previously, the input script has been reading and writing files in the Shared_Folder which was connected to the windows desktop. On a whim, thumbnails in Nautilus were disabled, all needed files were copied to the Ubuntu desktop, and the following script was run.
Results: For some reason, this worked. Perhaps the delay writing to the Shared_Folder caused a problem or perhaps the NTFS formatting of the host desktop caused permission problems for the script. But now, it works!
The script was stopped after 25 minutes. In that time, +700 files were created in the output folder most with names like
3 hours
Methods: continuing on from yesterday using Qiime 1.7.0 VM. This includes pprospector-1.0.1 and veinna-1.8.4. Previously, the input script has been reading and writing files in the Shared_Folder which was connected to the windows desktop. On a whim, thumbnails in Nautilus were disabled, all needed files were copied to the Ubuntu desktop, and the following script was run.
check_primer_barcode_dimers.py -p AATGATACGGCGACCACCGAGATCTACACTATGGTAATTGTTCCGCAAGTNGCVACBGTNGG -P ACCATCGTCAGRTARGGRAAVA -b barcodes.txt -e DNA_parameters/dna_DM.par -o out/
Results: For some reason, this worked. Perhaps the delay writing to the Shared_Folder caused a problem or perhaps the NTFS formatting of the host desktop caused permission problems for the script. But now, it works!
The script was stopped after 25 minutes. In that time, +700 files were created in the output folder most with names like
Line15_GAATACCAAGTC_primer1V86_primer1V86.px
and each around 5.8kB in size. In the 25 minutes it was allowed to run, the script had progressed from Line0 to Line20. The input file, barcodes.txt, contains 1056 lines of barcodes. Back of the napkin, 1000 lines at 25 mins per 20 lines is about 20 hours. Processor usage was about 30% each on 4 cores, so additional parallelization is possible. 3 hours
Aug 26, 2013
moving files to the cluster
August 26 for 2 hours
Objective: move files to Juniata's bioinformatics cluster and extract them.
Methods:
Email Chris Walls for an account on the cluster and cc Dr. Lamendella.
Download and install Cyberduck
Compress all the files you want to upload into a .zip archive. You can skip this step if the files are already compressed, like in a .gz or .tgz archive, of if you are uploading a small number of files.
Upload files with Cyberduck. Our server address is 10.39.6.10
Connect to our server over ssh (use putty on windows) and log in.
Extract the files.
For .zip files
For .tgz files
Results: files are on the cluster!
Objective: move files to Juniata's bioinformatics cluster and extract them.
Methods:
Email Chris Walls for an account on the cluster and cc Dr. Lamendella.
Download and install Cyberduck
Compress all the files you want to upload into a .zip archive. You can skip this step if the files are already compressed, like in a .gz or .tgz archive, of if you are uploading a small number of files.
Upload files with Cyberduck. Our server address is 10.39.6.10
Connect to our server over ssh (use putty on windows) and log in.
Extract the files.
For .zip files
unzip yourfile.zip
For .tgz files
tar xvzf file.tgz
Results: files are on the cluster!
pprospector-1.0.1 does not work
August 26 for 2.5 hours
Objective: Help Abby install primer prospector to test possible primers.
Intro: the program 'pprospector-1.0.1' is bundled with Qiime from 1.5.0 and newer. The program 'RNAfold' is a co-dependency and is included with Vienna RNA. Vienna RNA is included with Qiime 1.7.0 (maybe 1.6.0 as well), so the stock Qiime 1.7.0 VM should be fully capable of running primer prospector.
Methods: Ran
Results: file writing error
Objective: Help Abby install primer prospector to test possible primers.
Intro: the program 'pprospector-1.0.1' is bundled with Qiime from 1.5.0 and newer. The program 'RNAfold' is a co-dependency and is included with Vienna RNA. Vienna RNA is included with Qiime 1.7.0 (maybe 1.6.0 as well), so the stock Qiime 1.7.0 VM should be fully capable of running primer prospector.
Methods: Ran
check_primer_barcode_dimers.py
Results: file writing error
Apr 3, 2013
truncation scripts
Make a lists of files you want to process, and call it
Then run following in bash:
I hope this works.
all.txt
.Then run following in bash:
while read fna;
do read qual;
echo "Running $fna";
time truncate_fasta_qual_files.py -f $fna -q $qual -b 220 -o filtered/;
echo "";
done < all.txt
I hope this works.
Apr 1, 2013
quality scores script
In a terminal,
Get a list of all .qual files in that folder by running:
Analyse the quality of every .qual file in that list by running:
cd
to folder with all the converted .qual files.Get a list of all .qual files in that folder by running:
ls *.qual > quallist.txt
Analyse the quality of every .qual file in that list by running:
while read line;
do time quality_scores_plot.py -q $line -o quality_histograms/$line/;
done < quallist.txt
Mar 27, 2013
bash Scripts for Lab
Goal: make a bash script that runs convert_fastaqual_fastq.py on all the files in a folder.
Methods: First, we needs a list of all the files we want to run a script on.
Open a terminal, and use
Then run:
This takes the results of
Open list.txt in a text editor. If you don't want to run the qiime script on certain file, remove them from list.txt. Make sure there is one file name on each line and no extra or blank lines anywhere in the file.
Save it! Now copy and paste the following into that terminal you still have open:
This runs the script
Here is the same code generalized to run other scripts:
Methods: First, we needs a list of all the files we want to run a script on.
Open a terminal, and use
cd
to move into the folder you want.Then run:
ls > list.txt
This takes the results of
ls
and stores it all in list.txt.Open list.txt in a text editor. If you don't want to run the qiime script on certain file, remove them from list.txt. Make sure there is one file name on each line and no extra or blank lines anywhere in the file.
Save it! Now copy and paste the following into that terminal you still have open:
while read line;
do time convert_fastaqual_fastq.py -f $line -o Output_Fastq -c fastq_to_fastaqual;
done < list.txt
This runs the script
convert_fastaqual_fastq.py
on each file in list.txt with all the given flags.Here is the same code generalized to run other scripts:
while read line;
do time <copy and paste your script here, using $line as the input>;
done < list.txt
Feb 25, 2013
Qiime install disk
Goal: make a quick way to install Qiime.
Methods: the following files were burned to a single DVD.
Result: DVD was used by the class to install the Qiime VM. Perhaps a compression method that has faster decompression (LZMA2? LZO?) could be used next time. Addition copies of the DVD would also make the process much faster.
Methods: the following files were burned to a single DVD.
- Qiime 1.6.0 .vdi.gz (zipped Qiime image for the VM)
- VirtualBox (VM software for both Mac OSX and Windows)
- 7-zip (archiving/compression software for Windows)
- A README.txt describing the files.
The two VirtualBox files are installers for a Virtual Machine (VM) that will allow us to easily run Qiime. There is an installer for Windows and another for Mac OSX. (Using Linux? Talk to me.)
QIIME-1.6.0-amd64.vdi.gz is a compressed archive we will run using the VM. It contains all of the Qiime software libraries installed on the operating system Ubuntu 12.04 LTS. We will need to extract this archive and save to our computer before it can be used. It will take up to 30 GB of hard drive space when fully extracted.
7z920-x64.msi is a Windows installer for 7-zip. This program will allow us to extract the very large archive mentioned above. (It will also let us compress and extract other archives, which can save a lot of space.)
This was quickly created by Colin Brislawn on 2/24/2013 for the Lamendella Lab at Juniata College.
Result: DVD was used by the class to install the Qiime VM. Perhaps a compression method that has faster decompression (LZMA2? LZO?) could be used next time. Addition copies of the DVD would also make the process much faster.
Feb 17, 2013
Normalizing reads
Can Qiime analyze paired end reads?
Some parts of the workflow can handle it, but we will analysis from a single end because it that workflow has been established and used before.
When are the sequesces of the 515F or 806R primers?
In our sequences, will we see these primers or their reverse compliments?
In 16S, is the regions of high variability closer to the 515F or 806R primer?
Method to normalize reads:
Our data has already been quality filtered. Because we do not have to worry about quality, we will use the following process because it uses existing qiime scripts (and knowledge Erin's expert knowledge).
Use convert_fastaqual_fastq.py to get .fna and .qual files from the .fastq we have.
To run the files in bulk I added all the files names to a file called
I tried running this from an internal hard drive, but it was not faster at all.
I can truncate_fasta_qual_files.py on the resulting .fna and .qual files, using the following script.
This runs the truncation script on every file listed in
All files are now the right length for the next stage in the pipeline!
Some parts of the workflow can handle it, but we will analysis from a single end because it that workflow has been established and used before.
When are the sequesces of the 515F or 806R primers?
In our sequences, will we see these primers or their reverse compliments?
In 16S, is the regions of high variability closer to the 515F or 806R primer?
Method to normalize reads:
Our data has already been quality filtered. Because we do not have to worry about quality, we will use the following process because it uses existing qiime scripts (and knowledge Erin's expert knowledge).
Use convert_fastaqual_fastq.py to get .fna and .qual files from the .fastq we have.
convert_fastaqual_fastq.py -f file_you_want_converted.fastq -o output_directory -c fastq_to_fastaqual
To run the files in bulk I added all the files names to a file called
test.txt
and can the following bash script overnight:while read line
do time convert_fastaqual_fastq.py -f $line -o R1_fasta -c fastq_to_fastaqual
done < list.txt
I tried running this from an internal hard drive, but it was not faster at all.
I can truncate_fasta_qual_files.py on the resulting .fna and .qual files, using the following script.
while read line; do echo -e "\n\n running $line"; time truncate_fasta_qual_files.py -f $line.fna -q $line.qual -b 150 -o fasta_filtered/; done < files.txt
This runs the truncation script on every file listed in
files.txt
, which looks like this:ALXM_S15_L001_R1_001
BCWL_S16_L001_R1_001
CaRM_S17_L001_R1_001
CroRM_S18_L001_R1_001
...
All files are now the right length for the next stage in the pipeline!
Feb 13, 2013
A Quick and Dirty search for Bottlenecks
2 hours of work over 3 hours
Objective: find and remove bottlenecks from a system running qiime
Method: check resource use in the running system. Bottlenecks occur when one resource is the 'limiting reagent' in the speed of a program.
Actions Taken: I closed the script (ctl+c in terminal) and shut down the VM. In the VM's settings, I gave the VM a total of 15 cores and 16 Gb of ram. This improved speed tremendously (~1 order of magnitude). In retrospect, I should have given it more ram, as more was available and the 16 Gb was quickly used up. However, the CPUs were also maxed, so ram may not have been the primary bottleneck.
Objective: find and remove bottlenecks from a system running qiime
Method: check resource use in the running system. Bottlenecks occur when one resource is the 'limiting reagent' in the speed of a program.
- Understand the system
In this case the top part of the running stack was basically:
- Qiime script written in python
- running in the command line in Ubuntu 12.04 64-bit
- in an Oracle VM
- on a server running Windows Server 2008 R2E 64-bit
- Look where you can
We can't easily look at everything (CPU cycles, Python). What we can't see, we can't optimize. We are only going to worry about what we can see and change. From the list above, I choose to look at resource use in Ubuntu and Windows because both OSs have good, built-in resource monitors.
System Monitor in Ubuntu
Ubuntu is towards the top of the stack, so I'll look at it first.
- Click on Dash Home in the top right corner of the screen.
- Search for 'monitor' and open it. (you can right-click and 'lock to launcher too')
- It shows use of CPU, Memory (and swap), and network. You can also see number of CPU cores and RAM and all running processes.
Task Manager in Windows
This offers a quick comparison of what's going on in Windows and Ubuntu.
- Right click on the taskbar. Click 'Start Task Manager'.
- This shows CPU, Memory, network, and processes. You can also see number of CPU cores and RAM and all running applications, processes, and services.
This is like Task Manager, but a lot more detailed.
- Click start and search 'resource'
- CPU use is divided into processes and services and threads of each.
RAM is divided by process.
Disk reads and writes are shown by process. If Disk Access is the bottleneck, this is the quickest way to find it.
- Identify resources that are usually fully used...
... and then see if you can increase that resource.
Actions Taken: I closed the script (ctl+c in terminal) and shut down the VM. In the VM's settings, I gave the VM a total of 15 cores and 16 Gb of ram. This improved speed tremendously (~1 order of magnitude). In retrospect, I should have given it more ram, as more was available and the 16 Gb was quickly used up. However, the CPUs were also maxed, so ram may not have been the primary bottleneck.
Feb 7, 2013
Cleaning
I'm scheduled to clean the lab this week and I have never done this before. I dropped by before dinner and Erin showed me the basics.
I tidied up the lab, loaded up one box of pipette tips, readied to autoclave some prepared pipette tips.
With the help of another Aaron, I autoclaved several boxes of tips and returned them to the lab.
I tidied up the lab, loaded up one box of pipette tips, readied to autoclave some prepared pipette tips.
With the help of another Aaron, I autoclaved several boxes of tips and returned them to the lab.
Feb 2, 2013
Text Parsing
2.5 hours of work
Objective: Parse sequence files for Steven so they are in the correct format for further analysis.
Method: Get sequence files from Steven. I chose Qiime VM (1.6.0 on Ubuntu) as a platform for editing. Look into using python 2.7.3 to read and write files. Also looked into text parsing and string matching using python. I installed python and its IDE on VM.
Decide that this is taking way, way too long, and a different approach is needed. I have experience using simple regular expressions, so I tried using them.
On Windows 7, download and run Notepad++ from a .zip archive.
Using the built-in find-and-replace function, use following regexp:
Find:
Replace: Nothing! (This deletes the selected text)
This selects the first 'space' and 'forward slash' and 'everything' after that on a single line.
I ran this on 4 files, each about 200 mb of text. About 5 minutes total. Each file is about 160 mb after it is edited.
Results: Returned parsed files to Steven.
Moral: Don't reinvent the wheel. Instead attempting to learn new things, I should have tried to solve the problem with tools I am already skilled with and comfortable using. I did not need a fancy script, just a good text editor. Also, regular expressions always come in handy
Objective: Parse sequence files for Steven so they are in the correct format for further analysis.
Method: Get sequence files from Steven. I chose Qiime VM (1.6.0 on Ubuntu) as a platform for editing. Look into using python 2.7.3 to read and write files. Also looked into text parsing and string matching using python. I installed python and its IDE on VM.
Decide that this is taking way, way too long, and a different approach is needed. I have experience using simple regular expressions, so I tried using them.
On Windows 7, download and run Notepad++ from a .zip archive.
Using the built-in find-and-replace function, use following regexp:
Find:
/.+
Note the leading space.Replace: Nothing! (This deletes the selected text)
This selects the first 'space' and 'forward slash' and 'everything' after that on a single line.
I ran this on 4 files, each about 200 mb of text. About 5 minutes total. Each file is about 160 mb after it is edited.
Results: Returned parsed files to Steven.
Moral: Don't reinvent the wheel. Instead attempting to learn new things, I should have tried to solve the problem with tools I am already skilled with and comfortable using. I did not need a fancy script, just a good text editor. Also, regular expressions always come in handy
Jan 24, 2013
Test post please ignore
2 hours over 3 days
Objective: create a blog for use as a lab notebook.
Method: Check with Dr. Lamendella that publicly publishing lab notebooks is appropriate. (It is!)
Explore blogging services. I choose the service 'Blogger' because it integrates with other Google services I commonly use.
Register a descriptive blog name. Create a test post.
Results: The blog brislawnresearch.blogspot.com will be used to record information about research with the Lamendella Lab. A binder will also be used to organize notes taken on paper.
Afterthoughts: There are benefits and drawbacks of both electronic and paper records. By using this blog, I hope to gain the mobile availability and immediacy of electronic media while preserving the consistency and ease of sharing that paper provides. Or not; I might experience the inflexibility of electronic documents without benefiting from the easy sharing that paper provides.
Objective: create a blog for use as a lab notebook.
Method: Check with Dr. Lamendella that publicly publishing lab notebooks is appropriate. (It is!)
Explore blogging services. I choose the service 'Blogger' because it integrates with other Google services I commonly use.
Register a descriptive blog name. Create a test post.
Results: The blog brislawnresearch.blogspot.com will be used to record information about research with the Lamendella Lab. A binder will also be used to organize notes taken on paper.
Afterthoughts: There are benefits and drawbacks of both electronic and paper records. By using this blog, I hope to gain the mobile availability and immediacy of electronic media while preserving the consistency and ease of sharing that paper provides. Or not; I might experience the inflexibility of electronic documents without benefiting from the easy sharing that paper provides.
Subscribe to:
Posts (Atom)