D.3.2 Selecting tools for users’ experience testing
Abstract – The main aim of this deliverable was that of facilitating the selection of identification tools for users’ experience with schools, both by KeyToNature partners involved in WP8, and for the schools themselves. To do this, we had to classify our identification tools by some fundamental features, and aggregate them into a comprehensive and updated list. The classification was based on an advanced version of the data exchange agreement (D.4.4), while the fist basis of the list is the deliverable D.3.1. Produced in a very short time (3 months after the beginning of the project) the latter was a collection of metadata on the identification tools available within the KeyToNature Consortium. From month 4 to month 9, additional metadata were requested to all data providers, and the final result proved to be good enough as to allow the publication in the webpage of the project (August 2008) of a search engine which allows to browse through the more than 1200 identification tools gathered by KeyToNature partners to that date, classified on the basis of several fundamental features. This work was carried out in close collaboration with WP4, and the final result anticipates of 18 months an important part of the searchable database which was due in month 30 as a deliverable of WP4. The latter WP will take care of the further implementation of the database until the end of the project. The present deliverable consists of two parts: 1) The present paper-printed document, which provides: a) a general introduction on the main features used to formally classify the identification tools, b) an explanation of the searching fields, and, 2) The search engine itself, which can be consulted from the entry page of www.keytonature.eu, and which we consider as the most consistent part of the deliverable.
In 2008 almost all partners of KeyToNature have started to be involved in the activities of Workpackage 8 (users’experience). One of their main tasks at the beginning of the work is that of designing the approach they intend to follow for contacting schools in the respective countries. Such approaches may differ widely among partners, depending on several factors, such as the share of work in WP8, characteristics of the national school systems, and above all on the availability of identification tools which fit the requirements of partners in terms of language, educational levels and types of organisms. Therefore, it was essential to provide partners and schools with an overview of all of the identification tools which are available for testing. To do this, we had to classify our identification tools by some fundamental features, and aggregate them into a comprehensive list. The classification was based on an advanced version of the data exchange agreement (D.4.4), while the fist basis of the list is the deliverable D.3.1. Produced in a very short time (3 months after the beginning of the project) the latter was a collection of metadata on the identification tools available within the KeyToNature Consortium. From month 4 to month 9, additional metadata were requested to all data providers, and the final result proved to be good enough as to allow the publication in the webpage of the project (August 2008) of a search engine which allows to browse through the more than 1200 identification tools gathered from KeyToNature partners to that date, classified on the basis of several fundamental features. The new search engine, which is relevant not only for WP8 but also for several other workpackages, has considerably enriched the content of the Project website. The work was carried out in close collaboration with WP4, and the final result anticipates of 18 months an important part of the searchable database which was due in month 30 as a deliverable of WP4 (D.4.3). The latter WP will take care of the further implementation of the database until the end of the project. This deliverable consists of two parts: 1) The present paper-printed document, which provides: a) a general introduction on the main features used to formally classify the identification tools, b) an explanation of the searching fields, and 2) The search engine itself, which can be consulted from the entry page of www.keytonature.eu, and which we consider as the most consistent part of the deliverable.
THE MAIN STRUCTURAL FEATURES OF IDENTIFICATION TOOLS
This is a brief introduction to the main structural features of identification tools in general, which might be useful for better understanding those which have been selected for classifying and rendering searchable the identification tools produced by the KeyToNature partners. Levels of interactivity A first classification of identification tools can be based on the different degrees of interaction between human users and the tools themselves. Humans may be involved in the identification process as: · initiator, requesting an identification, · operator, performing standardized routine operations, · expert, adding expert knowledge or reasoning not embedded in the key itself, · teacher, explaining identification concepts. In principle, any form of involvement may be considered as an “interaction”, but it seems desirable to restrict this term to the cases where the interaction involves human knowledge and reasoning. In this sense one can define: · Automatic identification as a process that can be performed by a machine (with humans only initiating processes and performing standardized operations). Examples are DNA sequencing, bar-coding (e. g., Cowan & al. 2006) or DNA microarray methods (e. g., Loy & al. 2002, Leinberger & al. 2005), many image processing methods like human iris or face recognition, leaf outlines (Agarwal & al. 2006), recognition of spores (e. g., Chesmore & al. 2003), shell fish larvae (Tiwari & Gallager 2003), hymenopteral wings (ABIS method, Steinhage & al. 2001), and spiders (Do & al. 1999), or the fatty-acid-based MIS Microbial Identification System (MIDI 2007). Further examples and a discussion of the opportunities and obstacles to automated identification may be found in Gaston & O'Neill (2004). Dreams for the future include handheld devices usable by the general public, performing automated molecular identifications (Janzen 2004). · Interactive identification as a process where human knowledge and reasoning is supported by a knowledge base, whether or not a computer is involved in processes like sorting, lookup, and elimination of options or not. Note that, in contrast to this definition (which includes the majority of humans using a printed key), the term has been widely used to exclusively refer to computer-aided identification using multi-access or multi-entry keys (see later). The distinction between automatic and interactive identification tools is not always unambiguous. Automatic identification methods often require manual preparation and processing of objects which, although in principle standardized, may involve some complex choices to be made. Conversely, computer-aided interactive identification may occasionally require only fairly standardized operations by humans, e. g., comparing an object with colors or shapes displayed on a computer screen. One may say that a truly interactive identification occurs only if it exploits a substantial amount of human knowledge and experience (e. g., of terminology, classification, or frequency of object occurrence). A typical case of such interactive identification is the informed choice of characters in a multi-access key. Identification may be semi-automatic in the sense that a computer-aided system may present a choice of potential results based on fully automated methods (e. g., image processing), optionally together with an estimate for the likelihood of correct identification, but then requires a human confirmation of the result. All of the identification tools in the hands of KeyToNature are of the interactive type. Incidentally, it can be added that some partners of KeyToNature are presently exploring the possibility of linking a method of automatic identification (DNA-barcoding) with an interactive identification tool based on morphological characters. For example, DNA barcoding in vascular plants is rarely capable of identifying a plant at species level: most often it reaches the level of genus, subgenus, or of a group of closely related species. The idea which could be tested is to use the barcode as a “filter” in an interactive identification tool based on non-molecular characters, which would be automatically invoked – only for the “filtered” species, in order to permit identification at species level. Phases of interactive identification Most biological interactive identification processes, whether supported by a computer or not, may be divided into the following phases: 1) Orientation phase. Initially, the appropriate resources (e. g., identification tools, descriptions, images) and observation methods (hand-lens, microscope, chemical reagents, etc.) for the object at hand are selected. The choice of the right identification tool may depend on many factors beyond the recognition of a broad taxonomic group: e. g., expertise of identifier, geographic location, season (summer, winter). In publications addressing experts, the orientation phase may be considered as implicit, but publications for the general public or students will often explicitly discuss the choice of methods and often provide entry keys (e. g. to families or orders). A fundamental part of the identification process will almost always be based on common knowledge. However, the extent of what is considered “fundamental” (recognition of something as an insect versus as a member of the moth family Geometridae) is audience-dependent. 2) Identification phase, consisting of one or both of: a) Key phase. Using some form of identification tool (e. g., a dichotomous or multi-access key), the object is identified until only one, or a few taxa remain. Typically, only a subset of the potential descriptive characters will be used during this phase. Well-designed keys may come close to a binary search algorithm, which is very effective for a large number of species. In authored keys, e. g., dichotomous keys, the experience of generations of scientists is provided, leading to a selection of effective, convenient, and reliable characters. The phase may be skipped if previous experience already results in a sufficiently small set of taxa to warrant starting with a browsing phase. b) Browsing phase (“scanning” in Stevenson & al. 2003). Short descriptions or illustrations of organisms are compared with the object to find the correct class name. The phase is skipped if the key phase ends with a single identification. The number of taxa after which a browsing phase is considered more profitable than continuing the key phase will depend on the difficulty of the key, and may differ between professionals and amateurs. c) Confirmation (or verification) phase. The object is thoroughly compared with the information available for the identified class (a free-form natural language description, a tabular character synopsis, or images, audio tracks, videos, etc.). The use of a wide range of characters (in contrast to those selected during the key phase) helps to avoid errors due to oversight, bad material, misunderstanding of a character, or bad key design. Although probably most biologists have learned to distrust their own initial identification results, this phase is often overlooked in formal treatments of identification. The relevance of the phases depends on the experience of the person performing the identification. Experienced identifiers for a given taxonomic group may not recognize an orientation phase at all. The key phase may be skipped if a sufficiently small taxonomic group is suspected, warranting to start with a browsing phase. The suspected group may be as small as a single species (the object is “recognized” or “known”, Pankhurst 1993). Finally, depending on the estimated probability of recognition, even the confirmation phase may be skipped. This classification of the entire identification process offers important insight into how well-designed identification tools (be it printed or computer-aided) should be constructed. Material relating to the individual phases should be clearly and visibly marked, supporting the user to tailor the identification process to her or his needs. The material should be arranged in such a way that subsequent phases are arranged close to each other for fast access. The assumption that identification in popular plant or bird identification books (“field guides”) immediately starts with a browsing phase is somewhat erroneous. On closer inspection the user is usually guided by grouping the material into intuitive categories, often two or three levels deep (Stevenson & al. 2003). The result is equivalent to a short key phase. Even where no explicit “group keys” exists, some form of coarse classification (taxonomic or artificial, e. g., “flower color”) is usually embedded in the design of the book and occasionally a “key” is incorporated into the table of contents. Identifying a “bird” truly by browsing is not very practical, identifying a bird of prey, a dabbling duck, or a garden song bird by browsing is.
Structural classification of identification tools
In general, an identification tool is a device to accelerate the process of comparing a given object with all available object or class descriptions. Identification tools are closely related to indices in databases. This section tries to highlight essential differences of identification tools, especially where the difference is relevant to information models. The most common form of identification tool in biology uses a process in which the user compares the material to be identified with a group of logical propositions that evaluate to either true or false. The true proposition is accepted and leads to the next step in the identification process. Essentially, all propositions in a group must be mutually exclusive and exhaustive (i. e., no situation that is not covered by one of the propositions should occur at this point in the key). Propositions may involve categorical as well as quantitative data. In printed keys, but also in many computer-aided identification tools, users compare their measured values against fixed limits (e. g., “≥ 12”) defined in a proposition, instead of entering values directly (leg number, leaf length). Computer-aided keys may also support alternative methods, where quantitative values are directly entered and then algorithmically compared with values in databases. The matching process used to do so may involve fixed error margins or multivariate statistical methods. Propositions are found in two styles: · In the question / answer style each proposition is the combination of a question and one of multiple answers. Although this is the most common style in general questionnaires, it is relatively rare in printed or computer-aided identification tools in biology. Example: How many pairs of legs are present? a) 9-10, b) ≥ 12. · In the lead style, paired propositions are presented, one of which should evaluate to true. The lead style implicitly starts each step with the question “which of the following statements is true?”. This style is the most common in conventional, printed identification tools. Example: a) 9-10 leg pairs, antennae short, b) ≥ 12 leg pairs, antennae long In the question / answer style the elements of an identification step are kept together by the question. In the lead style, commonly other means (formatting, numbering, etc.) are used. The set of propositions that must be evaluated in a single step is commonly called a “couplet”. Although primarily used for lead style keys, in the following the term is used for both question / answer style and lead style. The classical key in biology is a dichotomous key, where the sequence of couplets is fixed and each couplet contains exactly two propositions (“leads”). A relatively frequent variant of the dichotomous key supports more than two leads per couplet: the recommended term for such a key is polytomous key. The term “polychotomous” is also found, albeit erroneously formed (dichotomy based on “dicho-”, Greek “dikho-” “in two”, “apart”, and “ tomy”, Greek tomos, a “slice”, “cutting”, “section”). Whether a key is dichotomous or polytomous is often irrelevant. In many treatments both types are simply called “key” or “diagnostic key”, which, however, also include keys where the sequence of questions is freely selectable by the user (multi-access keys, discussed further below). A specific and agreed term for the generalization of dichotomous and polytomous keys is lacking; the term branching key is chosen here. The main alternatives to branching keys are keys where the user may freely choose the questions to be answered,essentially telling his own story rather than being examined. Similarly to branching keys, no generally accepted and established term for these keys exists. Many terms have been proposed and at least one is confused and should be avoided (e.g. synoptic key). Following the review by Edwards & Morse (1995), the term multi-access key is preferred herein. A multi-access key has several advantages: · It allows the user to ignore any question considered undesirable. In a branching key the failure to answer the next question (e. g., because of interpretation problems, because the object part is missing, or because the feature is not expressed at the time when the object was collected) compels the user to follow all leads of the couplet. Determining which leads are dead ends and which lead to a successful identification is, in practice, very time-consuming and error-prone. · It allows the user to select the characters that are most conveniently observed in a given object. Although a branching key usually prefers convenient characters as well, while appropriate for the majority of taxa in the key, the selection may not be convenient for the current taxon. In a multi-access key the user is able to react to this. · In a computer-aided multi-access key, quantitative characters can be employed directly (entered as measurements), rather than requiring a previous fixed categorization (e. g., into “≥ 2 cm, “> 2 cm and < 5 cm”, “≥ 5 cm”). · Experienced users can apply previous knowledge to accelerate the identification progress. Even if they do not know the taxon name, they may often know that a particular character state is rare or even unique in a group. By starting with this, an identification that would require dozens of steps in a branching key may be finished after two or three steps. The choice between multiple questions at each step is the defining feature of multi-access keys. Without this being essential, most multi-access keys limit each question (i. e. couplet) to a single character, avoiding characters combined with ‘and’, ‘or’, or ‘not’. Most multi-access keys further make liberal use of more than two alternative answers; mostly one lead for every state of a character. Where too many states would result in a confusingly high number of leads, a lead may combine multiple states with ‘and’, ‘or’, or ‘not’ (either during key construction, or already embedded in the terminology). The preference for multi-character couplets and dichotomy in branching keys and the preference for single character couplets and multiple leads in multi-access keys are interrelated. In branching keys both the desire to create mutually exclusive and exhaustive alternatives, and the need to make the key practical by offering alternatives to characters that are not always observable, often leads to multi-character couplets. However, evaluating three or more complex, multi-character statements quickly becomes a challenge in Boolean logic, leading to a preference for dichotomous keys. In contrast, in multi-access keys, the tolerance of these keys to non-exclusive leads and to missing information on specific characters removes the need for combining characters. On the contrary, combining characters usually makes it more difficult for users to select the next couplet. In the context of single-character couplets, limiting the choices to two leads becomes artificial. The use of single or multiple characters in a key couplet is occasionally called a monothetic or polythetic key. In the Aristotelian sense monothetic or polythetic expresses whether class membership can be identified through an unambiguous combination of characteristics, i. e., whether a single set of necessary and sufficient conditions exists (monothetic) or not (polythetic). In a polythetic classification, members of a class have variable characteristics, and no single set of differential characteristics exists (Radford & al. 1974). A classification based on multiple characters may be either monothetic or polythetic. Polythetic taxon delimitations are indeed one reason why a key may have to use multiple characters, but – as discussed above – multiple other reasons exist. And conversely, polythetic taxa may alternatively be identified using single-character couplets, where taxa are keyed out in multiple places. Monothetic or polythetic should therefore not be applied when describing the structure of keys. Identification tools aiming at identification by browsing (especially field guides) may be considered another structural form of a key. However, these tools are usually organized into a sequence of categories, followed by descriptions or illustrations intended for browsing (and arranged taxonomically or alphabetically). The structure is thus identical to a branching key, where the number of leads in the browsing phase is usually much higher than in a “normal” polytomous key.
Most relevant criteria for a structural classification of identification tools To summarize, the most relevant criteria for a structural classification of keys are: · whether the couplets must be answered in a fixed sequence defined by the key authors, or whether the sequence is freely selectable by the user (branching versus multi-access key); · in the case of a branching key: whether each taxon is keyed out only once, whether a taxon may be keyed out in multiple places, or whether it supports redirections back into different branches of the key (“reticulated key”); · whether the propositions (i. e. leads) in a couplet are limited to two alternatives or not (dichotomous versus polytomous key); · whether each couplet is limited to a single character, or whether it may be a combination of multiple characters · in the case of multi-character propositions: whether Boolean operators such as ‘and’, ‘or’, or ‘not’ may be used; and whether couplets are a list of complete statements, or split into a question and answer parts (this may occur both in branching and multi-access keys); · whether the key, in addition to descriptive data, supports the selection of available observation conditions, instrumentation or methods. Both branching and multi-access keys can be presented in various formats or styles, that depend to a large degree on the presentation medium (printed or computer). In a small comparative study, Morse & al. (1996) found a small advantage of multi-access keys over (printed and hyperlinked) branching keys with respect to the accuracy of identifications (not statistically confirmed). At the same time, the use of multi-access keys required substantially more time. The latter result may have been due to the fact that in the study the branching keys included illustrations, whereas these had to be looked up separately in print when using the computer-aided multi-access key. This topic requires further studies in the future. It is likely that the relative strength of branching and multi-access keys strongly depends on the number of taxa in a key, the difficulty in finding a consistent set of reliable and easily observable characters, the variability with which necessary characters can be observed, and the experience of the user. Due to the interactive properties of multi-access keys, these have a higher potential to become faster with increased experience of the user than branching keys. Given that both branching and multi-access keys seem to have advantages, the information model should support both types.
“Promorph” and “looks-like” metaphors Instead of relying on analytical characters, a special form of “assisted object matching” relies on the human intuition for “similar” patterns or forms. This approach has been termed the “promorph method” by Fortuner (1989, 1993) and the “looks-like method” in the Electronic Field Guide project. Fortuner (1989) defines a promorph as “a form that can be recognized before detailed study of its morphology”. Promorphs, although typically supported by images, may be given names to be able to refer to the concept in written text. In these methods, images (photographs or generalized drawings) are used already at a high level in the key, representing a group of similar species rather than individual species. The user of a key is confronted with a set of images (e. g., of butterfly wing patterns) and chooses the one considered to be most similar to the object. By creating several such “test panels” and by studying human similarity estimates in known test cases, it is possible to restrict the scope of objects returned by a query. This identification model utilizes the unconscious human pattern similarity recognition and is thus a completely different kind of “matching method”. It can be fast, intuitive, and requires minimal or no knowledge of terminology. On the other hand, it is not strictly analytical. The typical users may classify a species under multiple promorphs, and some species may not fit into any larger group of promorph similarity (requiring either to add a specialized, ineffective promorph, or some method to communicate that “other” species exist as well). Promorph or “looks like” images typically guide the user to a set of species, narrowing the choices, so that the following steps in the identification (using diagnostic characters or images, including images highlighting diagnostically significant features become more efficient. The reliance on subconscious similarity estimates requires extensive testing of human similarity estimates for the objects to be identified. Similarity estimates may be culture-dependent and strongly depend on the expertise of the observer. An expert will include the known diagnostic features subconsciously into the similarity estimate, essentially weighting the general similarity estimates by recognizing parts that an inexperienced observer would probably ignore. This can be compensated by an appropriate choice of features displayed in the promorph image, but it makes the promorph images somewhat dependent on the set of promorphs that is displayed to the user. Looks-like and promorph-guided identification is probably a highly efficient way of identification by humans. However, the required extensive testing of all potential user groups of a key is time consuming and expensive, especially if the number of potential identification results (species richness in the scope) is large. For taxonomic groups with low commercial impact, these costs will often be prohibitive.
Other classification criteria for identification tools
In addition to the criteria of content, structural classification, and interactivity discussed so far, some secondary classifications criteria are commonly used. A major distinction occurs often between field guides and expert keys. Stevenson & al. (2003) show that the perception of what a “field guide” is, is strongly determined by market and publishing constraints, leading to browsing guides showing mostly the entire organism, tailored for a specific area, focusing on large and often colorful organisms that are abundant and easy to study. In practice, the balance between the market interest and the number of organisms in a taxonomic or ecological group and the geographic area corresponding to the market will determine, whether a field guide is reasonably complete (common for birds, mammals, amphibians, dragonflies, butterflies, trees, etc.) or whether the identification quality is compromised by ignoring a large proportion of the less frequently occurring or less showy taxa (common, e. g., for most plant, fungi, or insect groups). With increased availability of digital software and hardware, the publishing constraints can be lessened, allowing for unlimited support of color photographs even for taxa with a smaller market impact. Computer-aided identification tools allow for better integration of elimination approaches (e. g., key-based) with browsing approaches, and for improving the relation between the browsing and the confirmation phase (by reusing the same material, but tailoring the amount of detail shown to the specific phase). Thus the term “field guide” should best be used in its original sense, as a guide optimized for quick identification of organisms immediately during observation or collection. It is always desirable to reduce the amount of technical language used to minimum, but whether the lowest level of expertise currently achievable for a taxonomic group makes a field guide easily usable by general public or not, should not determine its status. Similarly, the relative amount of analytical keys (allowing a process of elimination) and browsable illustrations should not be an a-priori condition of “field guides”, but determined by the number of taxa required to be distinguished. Picture browsing works best with perhaps up to 50 taxa in a group determined by means of elimination. Other classification questions that may be relevant for both printed and computer-aided keys are: · Is the key the result of a design process (authored key, containing information possibly not available elsewhere) or is it algorithmically created based on available data? Both branching and multi-access keys may be authored or algorithmically created. The sequence and selection of characters in a branching key may contain distilled experience of generations of researchers. · Which combination of text and media resources (especially photos and drawings, called “multimedia taxonomic keys” in Morris & al. 2007) is used in the key? A key may consist entirely of images, entirely of text, of images with caption text, or of text with illustrations. · If text and media resources are combined, the latter may be directly integrated into the key structure (in-place), or linked through reference numbers or hyperlinks. In the latter case, the resources may be available on the same page or screen (in-view) or elsewhere (look-up, in computer-aided keys especially in the form of pop-up windows). · Do media resources represent the entire organism or are they analytical and specific to identification details relevant to decisions in the key? The latter criterion may be applied to photos and drawings (which may be appropriately cropped, or display area-of-interest boxes or arrows), but also to sound, video, or 3-dimensional voxel pictures.
Some further criteria are only relevant for computer-aided keys:
· Handling of quantitative data: a) Is it possible to directly enter quantitative values and compare these algorithmically with values in databases? b) Are error-tolerant comparison methods supported or simple value comparisons? c) Are multivariate statistical methods supported? · Error tolerance: a) Is the identification error-tolerant, i. e.: is the key able to suggest taxa that are close but inexact matches? b) Are contradictions silently accepted or is the user informed which data are in contradiction with the result? · Guidance in character selection (multi-access keys only). a Is character guidance authored or algorithmically calculated based on coded descriptive data? b) Can the list of characters be sorted such that recommended characters appear first? c) Does the character recommendation algorithm work for quantitative and categorical, or only for categorical characters? d) Are redundant characters (those that no longer contribute to the identification progress) marked in some way? Are they removed from the list of available characters and thus no longer available to confirm or contradict other data in the identification process? e) Does the guidance adapt to identification progress, i. e., is it based on the remaining taxa, or is it always based on the set of all taxa in the key? f) Are character applicability rules (“character dependency”) observed? Are inapplicable characters marked or completely removed? Are controlling characters implicitly scored if a dependent character is scored? · History: a) Is a history of “identification steps” or “information entered so far” available? b) How is this arranged (sequence of scoring, alphabetical, by concept, by part, etc.)? c) Can the user choose between different arrangements? d) Is it possible to revert (delete) or update (change) a previous identification step? · Is the key adaptive, i. e., does it change its structure based on previous information? Digitized branching keys might be simply hyperlinked, or they might hide / fold parts of the key as they become irrelevant. However, a branching key that is split into one web page per couplet would be indistinguishable from a system that is adaptive by other means. · “Granularity of interaction”: a) Is it possible to receive intermediate results (list of taxa remaining, number of taxa remaining)? Is this feed-back occurring automatically after each user action, or does it have to be explicitly requested? b) Is it possible to enter multiple observations or answer multiple questions before the next time-consuming interaction occurs? In a local application, such a time-consuming step may be the evaluation of best recommended characters, or the calculation of the list of remaining taxa. In a web-based application it may be relevant whether it is possible to perform several such steps (e. g., answer multiple couplets, or enter several quantitative values) before sending answers back to the server. · Are methods for a final “browse-identification” provided (i. e., if the identification is incomplete and terminates with a set of taxa rather than a single taxon)? What kind of information is provided during this phase? · Is it possible to switch between different identification methods? Are identification progress data transferred – at least in part – from one method to another? · Are higher-order identification tools integrated? Examples might be a color picker to input the color of an object part, an algorithmic shape picker, image or sound analysis of imported media files, or interfaces to automatic data collection routines (e. g., chromatographic data).
MAKING SEARCHABLE THE IDENTIFICATION TOOLS OF KEYTONATURE By August, 20, 2008, KeyToNature has gathered more than 1200 identification tools, which are now searchable on-line in the KeyToNature portal (www.keytonature.eu, Fig. 1) as a result of this deliverable. They encompass a broad palette of different formal typologies, and are characterized by widely different features, several of which are not considered in a formal classsification scheme (e.g. language, groups of organisms, portability etc.). The search engine (Fig. 2) includes a simple search for some of the most important fields, like types of organisms (Fig. 3), language (Fig. 4) , etc.) and an advanced search interface (Fig. 5) which permits the user to “filter” the tools on the basis of several characters. Important is the fact that the search result is not limited to metadata alone (Fig. 6), but it gives direct access to the tool itself in the case of on-line tools (Fig. 7), or to an illustrated and commented page in the case of tools with restricted access (e.g. CD-Roms, Fig. 8). Among the great number of characters usable for the final classification, we have selected a few which we consider as particularly important in the framework of the aims of the project. The searching fields will be briefly commented on in the following.
1) Titles & Keywords
This is a rather fuzzy, but useful searching field. It permits to search the titles of the identification tools, plus keywords and geographic areas as specified by the various data-providers. It can be used for searching titles (e.g. epifüütsed), groups of organisms (e.g. insects, plankton), geographic areas (e.g. Iberian Peninsula), etc.
2) Group of organisms
The available metadata do not allow - for the moment – an in-depth search on the taxonomic groups included into the various identification tools. Thus, only a few, very major groups have been made searchable. These are: 1) Animals: vertebrates (21 tools), 2) Animals: invertebrates (58 tools), 3) Fungi: Lichens (96 tools), 4) Fungi: non lichenised microfungi (162 tools), 4) Fungi: non-lichenised macrofungi (52 tools), 5) Plants: Algae (8 tools), 6) Plants: mosses (9 tools), 7) Plants: vascular plants (820 tools). If one is looking for something more specific (e.g. ‘insects’ or ‘fishes’) one can go back and use the keywords. This searching field could be considerably improved by adding as metadata a more structured taxonomy, or even the complete list of taxa included into each identification tool. The latter option could permit the user to search for all tools including a certain species. An example of its implementation is available in the Italian country page, for all tools produced by UNITS for vascular plants (see www.dryades.eu): here the user (e.g. a teacher) who wants to select a tool for letting identify a plant (e.g. a common weed like Plantago major) can rapidly find a list of all of them, with information on the educational levels for which they are best fit (Fig. 9).
This - of course - is one of the main non-structural characters of an identification tool which will be looked for by a teacher. At the moment we have tools in 12 different languages – including a few experimental keys written in minoritarian non-official languages such as Sardinian or Friulian - but with a very uneven coverage (predominant languages are English, Italian and Spanish). The number of available languages and their general share on the total is rapidly changing due to progress in other Workpackages of KeyToNature.
4) Geographic area
In the first version of the searching engine this important field was omitted. The reason was that the available data were poorly structured and rather fuzzy. Due to pressure from both KeyToNature partners and Focus Groups of teachers, at the last moment we decided to include it anyway. This required the modification-half structuring of some medatada. Two examples: 1) “global” and “World” where subsumed under the latter heading. 2) countries were arranged by continent (Estonia=Europe-Estonia etc.). The result is the possibility of searching through ca. 40 voices. The general response from Partners and Focus Groups was: “much better than nothing!”, but we would like to work further on this field. To be underlined is the fact that geographical coverage has different degrees of importance depending on the groups of organisms. An example: lichens are widespread across wide areas, so that a lichen key produced for Germany can be useful also in Holland or even in N Italy, whereas reptiles or vascular plants are much more restricted in distribution.
5) Number of taxa
One could wonder why the number of taxa included into every identification tool has been made searchable as an important feature. The reason is that generally the number of taxa may be an indicator of how ‘easy’an identification tool is. The lower is the number, the easiest is the tool. For example, if primary school teachers are looking for a key to vascular plants, they should select one which includes less than 100 species rather than one with more than 1.000. This, however, is true only within a given group of organisms. It may be OK for vascular plants, vertebrates etc., but it may not work for things such as algae or microfungi, where - even with a low number of species - one has to use microscopes or cultures on Agar to use the key. In this case, the number of taxa alone is not enough for selecting a tool which fits a certain educational level, more important being the instruments and methods required by the use of the key. This search field - for the moment - may be a surrogate for the astute teacher interested in certain groups of organisms, but should be transformed in the future into a very different one: educational level.
6) Key Structure We distinguish 5 main types: a) Dichotomous = fixed branching key limited to two leads. In this case interfaces can vary enormously, from simple texts (paper-printable, usually pdf files) to identification tools in which every single character is illustrated by pictures or drawings (difficult to print on paper, and mostly available via electronic media). A disadvantage of dichotomous keys is the fact that the user is forced to select at each step between two characters (or two clusters of characters): if a character cannot be observed, the identification pathway is blocked. To parly overcome this problem, all of the FRIDA-generated keys permit the user to print the dichotomous and illustrated skeleton of the they to the remaining species at any stage of the identification process. In such a way the user can try to overcome a difficult passage by browsing through the pictorial key containing the species which follow that passage. An advantage of dichotomous keys is that, contrary to all other types of tools listed below, they can be made independent from the server and software that has generated them, i.e. as a simple series of connected html pages. In this way, they can be easily made portable on pocket PCs and smartphones, also without a connection to the web (see later) b) Polytomous = fixed branching key with more than 2 leads. This type of key is suitable for pictorial interfaces, where for every page there is a gallery depicting the different character states. In some cases, such in the key to the birds of Europe by ETI (see http://www.eti.uva.nl/products/catalogue/cd_detail.php?id=230&referrer=search) the ilustrations completely replace the text (e.g. the silhouettes of birds among which the user has to make a choice). c) Multi-entry = fixed non-branching key in which more character states can be used at the same time. – Contrary to the dichotomous query interface, which forces the user to select, at each step, between 2 characters or clusters of characters, the multi-entry query interface permits the user to specify more characters in a simple step. The combination of selected characters acts as a filter in the database, and the result is a reduced list of all species which share the combination of characters selected by the user. The main problem of this interface is that it does not always lead to a single species. The problem has been overcome by the keys produced by FRIDA (UNITS) by automatically invoking the illustrated dichotomous key, limited to the species selected by the query. For example, if the user has selected “tree”, “leaves entire”, “fruit fleshy”, the result will be a list of all trees with entire leaves and fleshy fruits (the names being linked to pictures), plus the dichotomous key to those species only. Another possibility, for more expert users, is to select a family or a genus, obtaining the dichotomous key to all species of that family or that genus which are present in the key. d) Multi-access = the sequence of characters or leads can be freely chosen by the user at each step. – This is practically a dichotomous key in which - however - the user can, at each step, freely select between several dichotomies. With respect to a classical dichotomous key, this interface has the great advantage of not forcing the user to select between two options referring to characters which he does not understand or cannot see on the material at hand. For this reason, this interface is widely used worldwide. However, there are disadvantages as well, the main one being the fact that the characters are usually “atomized”, i.e. one can select a single character at each step, while well-edited classical dichotomous keys often permit to decide between combinations of different characters (which often are the best way to distinguish among species of critical groups). Adaptation of language, editing and portability are also reduced when compared to those of classical dichotomous keys. As far as the interfaces are concerned, multi-access key, while not printable on paper, allow for the introduction of pictorial aids to characters as well. e) Browsing = descriptions or images arranged in a long sequence. – This is the typical approach followed by field-guides, where the user tries to identify a species by browsing through a more or less well-organised series of pictures, sometimes provided also with short descriptions. It is a rather fuzzy approach identifications, which is usually disdained by academicians, but it is the preferred way for many amateurs. 7) Usability – portability It makes a big difference if one wants to use an identification tool at home/office (using standard computers), or in the field (with a Pocket Computer or a Smartphone). Portable identification tools can be split in two groups a) those which require a connection to the internet , b) those which stand alone, i.e. which can be downloaded from the web and stored in the memory cards of a portable device. Many portable versions have been developed in KeyToNature (both stand-alone and on-line). The stand-alone versions are very similar to the on-line dichotomous keys, and are forcedly limited to this type of query (dichotomous keys can have a life which is independent from the program that has created them, e.g. as a series of interconnected html pages). With respect to the on-line dichotomous keys, the stand-alone versions have some limitations in the number of pictures they contain and in other minor features. The on-line versions for mobiles have exactly the same features as the versions for standard computers, e.g. the FRIDA keys include also the multi-criterion query interface and the possibility of seeing all pictures available for each species, the only difference been the outline of the screenshots, which has been automatically adapted to the reduced size of the screens of pockect PCs and smartphones. The main disadvantage of the on-line versions for portables is the poor availability of internet connections in wild areas. On the contrary, the stand-alone versions, although less performing, have the great advantage that they can be used anywhere. Incidentally, all of the pdf files (e.g. Flora Iberica) can be printed on paper, and hence are ‘portable’ as well.
Our identification tools are either accessible on-line, or do exists only as physical objects, which one has to obtain upon request (e.g. CD-Roms). Most of the on-line resources are freely available from the Web, but some of them - for different reasons - have a restricted access. Some do exist only as CD- or DVD-Roms, and can be purchased or acquired for free by the respective data providers, others are available on-line, but only through a username+password furnished upon request by the data provider.
9) Interactivity This feature refers to the degree of interactivity of the identification tools with users. Two mian groups are distinguished: a) Static = no dynamic change of key/display except for simple hyperlink jumps. In our case, static tools are pdf files with descriptions, dichotomous keys, notes and often drawings. Most of them derive from the Flora Iberica Project. b) Dynamic = with dynamic changes of key/display at every input from the user. Most of our tools belong to the dynamic type. Static tools are often considered as outdated. However, KeyToNature has explored ways for incorporating static tools into a dynamic tool. An example is the key produced by UNITS and CSIC for the Jardin Botanico of Madrid (http://dbiodbs.units.it/carso/chiavi_pub21?sc=165): the basic outline of the key (by FRIDA) is dynamic, but the taxon pages contain a link to the static resources of the Flora Iberica project. This gives added value to both projects, and suggests that gathering more static resources could be fruitful for KeyToNature.
10) Host Application
This is the software necessary for using the tool. This software can be part of the identification tool itself, or the tool can be used with an application commonly available on a computer. Different applications require different skills. Currently, students of any level are able to use common applications such as web browsers. Most of the identification tools of the consortium require only a web browser or a pdf reader, which are usually available on any computer and mobile device Web browser = compatible with typical web browsers like Internet Explorer vers. 5-6, Firefox 1-3, Netscape 7 or higher. PDF Reader = Any software capable of displaying PDF files Acrobat Reader = a specific software if compatibility is limited to this. Lucid Player = CBIT Lucid software (in different versions) Linnaeus II player = ETI’s Linnaeus software (the player). Intkey = CSIRO’s Intkey software Custom = this value is used if the identification tool is uniquely coupled with custom-programmed software that has no independent name.
11) Data provider
This permits to select all the identification tools sorted by data providers.
PROBLEMS AND PERSPECTIVES As already mentioned in the introduction, the work for this deliverable was carried out in close collaboration with WP4, and the final result anticipates of 18 months an important part of the searchable database which was due in month 30 as a deliverable of WP4 (D.4.3). The latter WP will take care of the further implementation of the database until the end of the project. A main problem is how to guarantee the continuous updating of the database. Several solutions are possible and technically rather easy, but we currently investigate and test a method, where the uploading of data is a manual process distributed among all partners, but where the system for uploading already reports certain quality problems. The central harvesting and integration of such data is then planned to be a fully automated process. Rather than writing custom software for a repository, we plan to use a MediaWiki installation. For further details we refer to D.4.2.
¨ Agarwal, G.; Ling, H.; Jacobs, D.; Shirdhonkar, S.; Kress, W. J.; Russell, R.; Belhumeur, P.; Dixit, N.; Feiner, S.; Mahajan, D.; Sunkavalli, K.; Ramamoorthi, R. & White, S. 2006. First steps toward an electronic field guide for plants. Taxon 55 (3): 597-610. [Preprint at: http://herbarium.cs.columbia.edu/pubs/First_Steps_Toward_an_Electronic_Field_Guide_for_Plants.pdf, last retrieved 2007-04-20]
¨ Chesmore, D.; Bernard, T.; Inman, A. J. & Bowyer, R. J. 2003. Image analysis for the identification of the quarantine pest Tilletia indica. EPPO Bulletin 33 (3): 495-499.
¨ Cowan, R. S.; Chase, M. W.; Kress, W. J. & Savolainen,V. 2006. 300 000 species to identify: problems, progress, and prospects in DNA barcoding of land plants. Taxon 55 (3): 611-616.
¨ Do, M. T.; Harp, J. M. & Norris, K. C. 1999. A test of a pattern recognition system for identification of spiders. Bulletin of Entomological Research 89: 217-224.
¨ Edwards, M. & Morse, D. R. 1995. The potential for computer-aided identification in biodiversity research. Trends in Ecology and Evolution 10 (4): 153-158.
¨ Fortuner, R. 1989. A new description of the process of identification of plant-parasitic nematode genera. In: Fortuner, R. (ed.). Nematode identification and expert-system technology. Plenum Publishing Corp.: New York: 35-44.
¨ Fortuner, R. 1993. The NEMISYS solution to problems in nematode identification. Chapter 9 In: Fortuner, R. (ed.) Advances in computer methods for systematic biology. John Hopkins Univ. Press: Baltimore, USA: 137-164.
¨ Gaston, K. J. & O’Neill, M. A. 2004. Automated species identification: why not? (One contribution of 19 to a Theme Issue ‘Taxonomy for the twenty-first century’). Philosophical Transactions of the Royal Society B: Biological Sciences 359: 655-667.
¨ Janzen, D. H. 2004. Now is the time. (One contribution of 19 to a Theme Issue ‘Taxonomy for the twenty-first century’). Philosophical Transactions of the Royal Society B: Biological Sciences 359: 731-732.
¨ Leinberger, D. M.; Schumacher, U.; Autenrieth, I. B. & Bachmann, T. T. 2005. Development of a DNA microarray for detection and identification of fungal pathogens involved in invasive mycoses. Journal of Clinical Microbiology 43 (10): 4943-4953.
¨ Loy, A.; Lehner, A.; Lee, N.; Adamczyk, J.; Meier, H.; Ernst, J.; Schleifer, K.-H. & Wagner, M. 2002. Oligonucleotide microarray for 16S rRNA gene-based detection of all recognized lineages of sulfate-reducing prokaryotes in the Environment. Applied and Environmental Microbiology 68 (10): 5064-5081.
¨ MIDI 2007. Sherlock Microbial Identification Systems. http://www.midi-inc.com/index.html. [Last retrieved 2007-05-10]
¨ Morris, R. A.; Asiedu, J. K.; Haber, W.; SaintOurs, F.; Stevenson, R. D. & Tang, H. 2007. Database-backed decision trees with application to biological informatics. Journal of Intelligent Information Systems 29 (1): 25-38. [Online version, doi:10.1007/s10844-006-0029-5]
¨ Morse, D. R.; Tardivel, G. M. & Spicer, J. 1996. A comparison of the effectiveness of a dichotomous key and a multi-access key to woodlice. Technical Report 14-96, Computing Laboratory, University of Kent, Canterbury, UK, August 1996. http://www.cs.kent.ac.uk/pubs/1996/44/index.html [Available as postscript format, last retrieved 2007-04-21]
¨ Pankhurst, R. J. 1993. Principles and problems of identification. In: Fortuner, R. (ed.) Advances in computer methods for systematic biology. John Hopkins Univ. Press: Baltimore, USA: 125-136.
¨ Radford, A. E.; Dickison, W. C.; Massey, J. R. & Bell, C. R. 1974. Vascular Plant Systematics. Harper & Row: New York, USA. ¨ Steinhage, V.; Arbuckle, T.; Schröder, S.; Cremers, A. B. & Wittmann, D. 2001. ABIS: Automated Identification of Bee Species, BIOLOG Workshop, Dec. 5-7, 2001, Bonn. German Programme on Biodiversity and Global Change, Status Report 2001. German Ministry of Education and Research (BMBF), Bonn: 194-195.
¨ Stevenson, R. D.; Haber, W. A. & Morris, R. A. 2003. Electronic field guides and user communities in the eco-informatics revolution. Conservation Ecology 7 (1): 3. [Online: http://www.consecol.org/vol7/iss1/art3, last retrieved 2007-05-09]
¨ Tiwari, S. & Gallager, S. 2003. Identification of bivalve larvae using multiscale texture and color invariants. Technical report, Woods Hole Oceanographic Institution. http://4dgeo.whoi.edu/lihdat/waveletpaper.pdf. [Last retrieved 2007-03-31]