Feb 2, 2013

Text Parsing

2.5 hours of work
Objective: Parse sequence files for Steven so they are in the correct format for further analysis.
Method: Get sequence files from Steven. I chose Qiime VM (1.6.0 on Ubuntu) as a platform for editing. Look into using python 2.7.3 to read and write files. Also looked into text parsing and string matching using python. I installed python and its IDE on VM.
Decide that this is taking way, way too long, and a different approach is needed. I have experience using simple regular expressions, so I tried using them.
On Windows 7, download and run Notepad++ from a .zip archive.
Using the built-in find-and-replace function, use following regexp:
Find: /.+ Note the leading space.
Replace: Nothing! (This deletes the selected text)
This selects the first 'space' and 'forward slash' and 'everything' after that on a single line.
I ran this on 4 files, each about 200 mb of text. About 5 minutes total. Each file is about 160 mb after it is edited.
Results: Returned parsed files to Steven.
Moral: Don't reinvent the wheel. Instead attempting to learn new things, I should have tried to solve the problem with tools I am already skilled with and comfortable using. I did not need a fancy script, just a good text editor. Also, regular expressions always come in handy

No comments:

Post a Comment