Saturday, April 18, 2009

Pre-Parsing Label Names

Status Update Summary:
I have successfully coded the portion of my project which involves pre-parsing Newick tree files so as to display the scientific names of the organisms rather than the index numbers. 

Explanation:
Newick files exist in the following format, and their structure is defined by commas and parentheses: 

#NEXUS
BEGIN TREES;

TRANSLATE
1 Alligator_mississippiensis,
2 Chinchilla_brevicaudata,
3 Felis_silvestris_catus,
4 Balaenoptera_borealis,
5 Oryctolagus_cuniculus,
6 Balaenoptera_physalus,
7 Mesocricetus_auratus,
8 Meleagris_gallopavo,
9 Lepisosteus_spatula,
10 Camelus_dromedarius,
11 Proechimys_guairae,
12 Anas_platyrhynchos,
13 Platichthys_flesus,
14 Goosefish_lophinus,
15 Hydrolagus_colliei,
16 Canis_familiaris,
17 Physeter_catodon,
18 Acomys_cahirinus,
19 Hystrix_cristata,
20 Myocastor_coypus,
21 Struthio_camelus,
22 Myxine_glutinosa,
23 Saimiri_sciureus,
24 Cavia_porcellus,
25 Elephas_maximus,
26 Gadus_callarias,
27 Cyprinus_carpio,
28 Thunnus_thynnus,
29 Equus_caballus,
30 Crotalus_atrox,
31 Batrachoididae,
32 Myoxocephalus,
33 Gallus_gallus,
34 Capra_hircus,
35 Oncorhynchus,
36 Mus_musculus,
37 Homo_sapiens,
38 Anser_anser,
39 Ovis_aires,
40 Sus_scrofa,
41 Bos_taurus,
42 Mammalia,
43 Macaca
;
TREE tree_0 =  [&R] (((((((11,20),24),19),2,7,18,(36,23),37,43,5,(16,17,6,40),29,25,41,3,4,(34,39)),10),(33,8,21),30,1,15,(38,12),9,((31,26),32,14,28,13,27,35)),22);
ENDBLOCK;


Currently, PhyloWidget only looks at the last line, where the structure of the tree is defined. As a result, the leaves are labeled with numbers rather than the scientific names. My pre-parser replaces all of the numbers in the last line with the corresponding scientific names, so that the true names become the label names stored in each leaf node. For the tree listed above, the following image is now the new output: 


Since PhyloWidget supports loading trees from files as well as from manual input, I've inserted the pre-parsing phase into both processes, of course with the expectation that both the file and the user-input be formatted according to my example above. 

No comments: