Annotation tool to extract endangered animals from text resources
Anteater Documentation

Basic idea

The basic idea of Anteater is the following:

  • Extract names of people (and if possible institutions), species names and places
  • Filter extracted information to retrieve applicants, species and location of applicants/locations of research
  • Associate people with species and locations according to "who research on what species where."

Step 1: Extract people, species and places

Each type of information (people, species and place) is extracted differently. Names of peoples are extracted using the Stanford NLP library (http://nlp.stanford.edu/). This library has a named entity recognizer (NER) that is employed to generate candidates for applicant names. The NER returns a text marked up with persons, locations and organizations.

To extract species the text is send to the Global Names Recognition and Discovery (GNRD) webservice (http://gnrd.globalnames.org/). GNRD returns XML that is parsed to generate candidate species.

The extraction of places works similar to the extraction of species. Only difference is that the text is send to Yahoo! Placemaker (http://developer.yahoo.com/geo/placemaker/).

Step 2: Filter candidates

The results created in step one have different qualities. Therefore, for each kind of information a different filter mechanism is used. The GNRD service has a very high precision rate; almost all found species names are actually species names. To find the few names that are incorrectly found, names are checked against found places and filtered out if Placemaker found them, as well.

To filter people found by the Stanford NLP library, a machine-learning algorithm is used. Its input are all persons, locations and organizations found by the Stanford NER since it is not very accurate in classifying (e.g. a organization might be tagged as location). Each instance in the model (each candidate for being an applicant) analyzed by the algorithm is described by 20 features, such as length of a name (person name, institution name, etc.), is it subject of the sentence, etc. (see appendix A for a list of all features). The model was trained on 186 instances using a “LADTree” classifier. With a training set of 2/3 of all instances and a test set of 1/3 of the instances, the results were 90.5% correctly classified instances and 9.5% incorrectly classified instances. The model used in Anteater was trained with the complete training set using cross-validation.

To filter location results a machine-learning algorithm is used as well. The algorithm used is a “LMT” tree classifier. The classifier was trained on 266 instances with each 20 features (see appendix A). It tries to classify four classes: not relevant, location of research, location of applicant and institution of applicants. Training results are 86.5% correctly classified instances and 13.5% incorrectly classified instances.

Step 3: Creating "research events"

The results of step 2 are used to create so-called "research events." A research event con-tains applicants of a permit/application, their locations and if applicable and available their institutions, species that the applicant applied to conduct research on, and the locations research would be conducted.

To create these events, several rules where defined to collect and sort the information found in step 2.

Output

Anteater creates the following files (all in XML format):

  • Analysis files
    These files contain the results from step 1 and the text that is analyzed.
  • Machine Learning files
    These files contain the data sets that are used in the machine learning component in step 2 (for applicants and locations).
  • Pre-result files (can be downloaded)
    These files contain the analyzed texts (summary and supplementary information) marked up with found applicants, locations and species.
  • Event files (can be downloaded)
    These files contain the generated research events and are the final results.
Anteaters are found in Central and South America, where they prefer tropical forests and grasslands. There are four different species which vary greatly in size. (http://animals.nationalgeographic.com/animals/mammals/giant-anteater/)