The basic idea of Anteater is the following:
Each type of information (people, species and place) is extracted differently. Names of peoples are extracted using the Stanford NLP library (http://nlp.stanford.edu/). This library has a named entity recognizer (NER) that is employed to generate candidates for applicant names. The NER returns a text marked up with persons, locations and organizations.
To extract species the text is send to the Global Names Recognition and Discovery (GNRD) webservice (http://gnrd.globalnames.org/). GNRD returns XML that is parsed to generate candidate species.
The extraction of places works similar to the extraction of species. Only difference is that the text is send to Yahoo! Placemaker (http://developer.yahoo.com/geo/placemaker/).
The results created in step one have different qualities. Therefore, for each kind of information a different filter mechanism is used. The GNRD service has a very high precision rate; almost all found species names are actually species names. To find the few names that are incorrectly found, names are checked against found places and filtered out if Placemaker found them, as well.
To filter people found by the Stanford NLP library, a machine-learning algorithm is used. Its input are all persons, locations and organizations found by the Stanford NER since it is not very accurate in classifying (e.g. a organization might be tagged as location). Each instance in the model (each candidate for being an applicant) analyzed by the algorithm is described by 20 features, such as length of a name (person name, institution name, etc.), is it subject of the sentence, etc. (see appendix A for a list of all features). The model was trained on 186 instances using a “LADTree” classifier. With a training set of 2/3 of all instances and a test set of 1/3 of the instances, the results were 90.5% correctly classified instances and 9.5% incorrectly classified instances. The model used in Anteater was trained with the complete training set using cross-validation.
To filter location results a machine-learning algorithm is used as well. The algorithm used is a “LMT” tree classifier. The classifier was trained on 266 instances with each 20 features (see appendix A). It tries to classify four classes: not relevant, location of research, location of applicant and institution of applicants. Training results are 86.5% correctly classified instances and 13.5% incorrectly classified instances.
The results of step 2 are used to create so-called "research events." A research event con-tains applicants of a permit/application, their locations and if applicable and available their institutions, species that the applicant applied to conduct research on, and the locations research would be conducted.
To create these events, several rules where defined to collect and sort the information found in step 2.
Anteater creates the following files (all in XML format):