Friday, March 20, 2009

Common Name Search Begins

I am working on a portion of code which will take the scientific names of each organism in a given tree and search for the corresponding common name of each one. I am doing the search using the database at ubio.org. I can currently post a search term to the form and retrieve the resulting HTML page. I can parse that page to determine whether it is the page for a specific organism or whether the program has found multiple possible matches and is asking for more information. 

The current issue is that the HTML I get back from the initial search is not the same as the HTML I get when I do the search in a browser and look at the resulting source code. That's definitely a sizeable issue. The URL is the same, but one's looking for more information and the other is an "Advanced Search" page that I can't even navigate to if I try in my browser. 

I at least have my algorithm written out for how I intend to parse the HTML in order to find the common name of the organism (once I figure out how to get to the correct HTML).

Enter search term into ubio.org. Get resulting page

  

     after text string "Scientific Match" search for "a href ='" string

   save the next text till "'" as a link string

   save a substring of the text after the next > and before <

   compare that to the search term. 

   If it's a match

   follow the saved link 

on the resulting page: 

Find the text "name = 'Common name'"

  After that point, find "namebankID"

  Save a substring of the text after the next > and before <

   return the substring as the common name 

Else 

  repeat by searching for next "a href = '" string

  Stop when you've reached ". 

  Return either the search term or a null string as the common name.

  


No comments: