Key to Nature EU deliverable D 4.4: Resource Metadata Exchange Agreement

From KeyToNature
Jump to: navigation, search
This version is equivalent to the OpenOffice/Word/PDF version presented as the Key to Nature EU Project Deliverable D 4.4. Please make only minor error corrections here. The version of this agreement for ongoing changes and improvements may be found under Resource Metadata Exchange Agreement.

Key to Nature
Standard agreement for metadata exchange
Deliverable number D 4.4
Status Prefinal
Author Gregor Hagedorn (JKI – formerly BBA)
Contributors (in alphabetical sequence) Marina Ferrer Canal (CSIC); Gideon Gijswijt (ETI); Stefano Martellos (UNITS); Pier Luigi Nimis (UNITS); Bob Press (NHM); Tiina Randlane (Univ. Tartu); Andres Saag (Univ. Tartu); Tomi Trilar (PMSL), Gisela Weber (JKI)
Dissemination level Public
Delivery date 31. August 2008

European Union flag.png
This project is funded under the eContentplus programme1 programme, a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable. — Contract no. ECP-2006-EDU-410019
1 OJ L 79, 24.3.2005, p. 1.


One aim of the Key to Nature project is making primary data (identification tools) and secondary data (images, sounds, videos, taxon pages) better searchable, accessible, usable and re-usable. This is to be achieved by a search tool which will make widely distributed digital objects online searchable using a single query interface. Rather than merging data in a central repository, only metadata are integrated according to structural and content standards agreed upon in this metadata exchange agreement.

In a slight deviation from the original plan, the agreement covers both the exchange of metadata about identification keys (WP3) and digital media ("secondary data", WP4).

The metadata exchange agreement defines types of resources: a) institution or organization (“Provider”), b) resource collections, and c) individual resources like “IdentificationTool”, “StillImage”, “TaxonPage”, etc. The metadata for each resource include information about, e.g., title, keywords, taxa, geographic location, copyright, license, format, access type (printed, offline, online, free, login) and URIs. It supports resources available at different quality levels (e. g., high-resolution, web-optimized, or different thumbnail sizes) under different URIs, which is typical for images or sounds. Information in multiple languages (e.g. title in English as well Slovenian) is supported. Information is given which metadata item is required, urgently requested, or optional for which resource type.

The metadata structure employed for data exchange so far is a flat table structure where and all data can be exchanged in a single table and exchanged through appropriate document-based mechanisms (e. g., tabulator delimited Unicode text files or Microsoft Access databases). However, a new, wiki-based method using the same metadata fields, but a different, wiki-based exchange format is currently being tested and outlined towards the end of this document. We hope to overcome problems in workflow, quality control, distribution of work, and long-term maintainability with this method, while preserving the chance for integrating data providers without advanced IT capabilities. For high-tech data providers, installing xml-based data exchange through custom webservices or standard OAI-PMH methods would be another alternative.

To allow safe publication of data sharing, agreements on metadata sharing, on a license to create small preview versions and on a backup or central storage service are included, which are requested to be signed by the data providers.



The purpose of this agreement is to share metadata about identification keys available in digital form and resources that are relevant to the creation and enhancement of such identification keys. The following resource types are considered in this agreement:

  • Identification tools (= identification data plus applications where necessary), ranging from multi-access keys and interactive branching keys to static branching keys (which may be available only as PDF). These resources may be valuable both as (a) primary data forming the core of an eLearning package and (b) as resources used to link to when an identification result is reached (e. g., genus keys may be the results of a family key).
  • Media files (still-images, audio, video) are valuable to illustrate either definition of terms (characters/character states) or their expression in specific taxa.
  • Complex documents (PDF, web pages) describing species or other taxa (“taxon pages”, “species pages”) are valuable as primary or secondary (“see also”) results of other keys.

The purpose of the data sharing is primarily to share information about the resources, not directly to share the resources themselves. In general, the project does not plan to store or provide resources from project servers. As soon as resource sharing is intended (e. g., a common identification tool repository, tiny thumbnails of images, reuse of images in an eLearning package) this requires separate agreements from the main metadata agreement.

Data exchange using a flat table structure

The metadata structure proposed here is relatively flat and all data can – if desired – be exchanged in a single table. However, within this table, three types of resources are used to simplify data exchange and avoid unnecessary repetition:

  1. A few metadata items are requested for your entire institution or organization to help the planned system to attribute (or “brand”) information as coming from and belonging to your institution or organization. Note that most metadata elements are not applicable (“–“) for this resource type.
  2. Resources may be grouped into collections. These collections may reflect existing management procedures, different sources or authorship, etc. at your institution. Please view them as an opportunity to make the relation between resources better visible to the future user. By allowing a significant amount of information to be given on the collection level, providing the information for individual resources may be less laborious. Where you find collections undesirable or a burden rather than an opportunity (e. g. in the case of your identification keys), simply create a single “catch-all” collection like “identification keys from x.y.” We do need at least one collection for each provider.
  3. The central item are the metadata for each individual resource. Whereas the information on provider and collections is used to supply search results with a context, the single resource metadata allow to find appropriate resources.

The packaging of data into file formats for the purpose of the exchange is described further below under “Data packaging (file or upload formats)”.

Metadata agreement

By sharing metadata like web-address (= “URL”), titles and captions, keywords, taxon names, authorship, copyright and license, access to these resources is improved. Although ideally resources would be available directly on the internet (i. e. have a URL), this is no requirement. Consortium partners will also benefit from information that help to locate resources or resource versions (e. g., high quality images or sounds) that are not available directly, but may be available after direct negotiations.

The agreement tries to be permissive with respect to required data. Instead of “required” we use the term “urgently requested” (code: “R”) to indicate the request to make an effort to try to obtain this information. Other information is usually optional (code: “o”), some may be not applicable to some resource types (code: “–”). To account for the different resource types, the metadata concepts are annotated in the first columns of the following table using these codes.

In some cases it may be difficult to provide requested information. For example, although it is very important to have copyright and licensing information, it may not be feasible to research them for large resource collections. In such a case it is permissible to use “neutral” statements like “copyright by owner” or “Licenses will be individually negotiated”. The Spanish partner, for example, at the moment uses: "Licenses (except for use within the context of must be individually negotiated" to avoid making commitments now. Another form that is essentially just as neutral, but offers may be “Reuse under a cc license will be considered after individual requests”.

Please consider using a Creative Commons license ( to help maintain the traditional sharing of scientific information in an increasingly legalistic world. A common form is the cc by attribution – non-commercial – share-alike license, meaning that the work may be modified and included in other non-commercial works, provided that the license is maintained (i. e. the new work is available to you under the same conditions) and that the source is cited and attributed.

The following table lists a number of metadata field names (“elements”) to agree upon. This is a first attempt to reach an agreement. As outlined below under “Extensibility”, you may add further fields.

The first (colored) columns explain the expected applicability of fields to different resource types. Please ask back if the expectations expressed here do not mark your situation. Urgently requested fields are marked “R”, optional fields “o”, fields considered not applicable are marked with (“–“). On the collection level, fields that define a default value for the resources contained in the collection (rather than applying to the collection itself) are suffixed with “M” for Master-field.

The proposal is based on DublinCore, the IPCT/Adobe XAMP standard, experiences with previous projects, and specific discussions at the Key2Nature Kick-off meeting in Triest. Please review and criticize it. For those acquainted with the DublinCore standard, related DublinCore elements are annotated as subheading (dc:type, etc., with gray background). These are only for information and can be safely ignored if your data are not mapped to DublinCore.

Tabular list of metadata fields

To the left you find the applicability of metadata fields

↓ Applicability of metadata field for data providing institution, organization, or individual
↓ Applicability for each collection of identification tools, images, or other resources
↓ Applicability for identification tools, or eLearning packages
↓ Applicability for taxon pages, glossary pages, maps, or data sets
↓ Applicability for media resources (still image/audio/video/etc.)
Metadata fields ("elements"):
Field derived from dc:type:
Type Provider (fixed value for provider metadata), Collection (fixed value for collections), StillImage, Sound, MovingImage, Map, etc., see accompanying value list further below in this document.
Fields derived from dc:title:
Title Concise title, name, or label of institution, resource collection, or individual resource. This field should include the complete title with all the subtitles, if any. This field will be the primary basis on which users will select and recognize resources. If you have no “real” title – as frequently occurring in images – please try to generate one. Often the taxon name(s) will form a good substitute title; or the file name itself may contain title-like information.
Logo The URL of icon or logo image to appear in source attribution. Entering this URL into a browser should only result in the icon (not in a webpage including the icon).
Homepage URL of page to which source attribution (title or icon) will link.
Fields derived from dc:description:
Description Description of collection or individual resource, containing the Who, What, When, Where and Why as free-form text. It optionally allows to present detailed information and will in most cases be shown together with the resource title. If both description and caption (see below) are present, a description is typically displayed instead of the resource.
Caption As alternative or in addition to description, a caption is free-form text to be displayed together with (rather than instead of) a resource that is suitable for captions (especially images). Often only one of description or caption is present; choose the concept most appropriate for your metadata.
Fields derived from dc:identifier:
CollectionID An arbitrary code that is unique among a provider's metadate records of type “Collection” and by which the media resources are linked to their collection. Each image, sound or taxon page should belong to a collection. Examples: "1", "BrdSng", "328423".
ProviderManagedID A free-form identifier (a simple number, an alphanumeric code, a URL, etc.) that is unique and meaningful primarily for the data provider. Ideally, this would be a globally unique identifier (GUID), but the provider is encouraged to supply any form of identifier that simplifies communications on resources within the project and help to locate inidividual data items in the providers data repositories. It is the providers decision whether to expose this value or not.
Fields derived from dc:rights:
CopyrightStatement A full-text, readable copyright statement, as required by the national legislation of the copyright holder. On collections, this applies to all contained objects, unless the object itself has a different statement. Examples: “Copyright XY 2008, all rights reserved”, “© XY Museum2008”. Do not place just the name of the copyright holder here!
License The license statement defining how resources may be used. Example: "Available under Creative Commons by-nc-sa 2.5 license". Information on a collection applies to all contained objects unless the object has a different statement.This also informs on the commercial availability of items. Buying an identification tool or media resource is essentially the purchase of an individual license. Examples for such License statements: “Available through bookstores” for a commercially published CD, in License; “Individual licenses available for purchase” for a high-resolution image (note that the medium or low resolution levels of the same image may be available under Creative Commons!).
CopyrightOwner The holder or owner of the copyright. (Note: ALA uses dc:publisher for this purpose, but it seems doubtful that the publisher is by necessity the copyright owner.)
Fields derived from dc:creator:
Creator Creator(s) of resource (for images: the photographer, not the digitizer). Ideally just the name(s), but it may also contain a more elaborate credit text. Use semicolons to separate multiple names. Avoid using commas: Do not invert names into “Lastname, given name” and use parenthesis for localities: “Williams (NHM)” instead of “Williams, NHM”
MetadataCreator Creator(s) or editor(s) of title, description, keywords, etc. Use semicolons to separate multiple names. This should not be a person simply typing existing content, or converting digital formats (but such a person may be added if substantial editing changes were necessary).
Fields derived from dc:language:
Language Language(s) of resource itself. One of "zxx" for language-neutral images/nature sounds; a semicolon-separated list language codes (e.g., "en; it"), if the resource is specific to one or several lang­uages; or "und" for resources specific to an unknown/un­defined language.
MetadataLanguage Language of description and other meta data (but not necessarily of the image itself). The metadata language should normally be a single language code, not a list! Please try to structure your data accordingly.
Fields derived from dc:date:
TemporalCoverageStart The single date (“creation date”) or start of time period at which the media resource was originally created and to which it applies. In the case of non-digital images/sounds this should be left empty if only a digitization date is known (see below). Use the international (xml) format yyyy-mm-ddThh:mm (e. g. "2007-12-31" or "2007-12-31T14:59"). Where desirable, timezone information may be added.
TemporalCoverageEnd Optionally, if the end point of a time period (this data element is most likely applicable to resource collections).
DigitizationDate Date the first digital version was created, where different from temporal coverage. This is often *not* the file creation or modification date, which often only captures the last format change or processing. Use the international date format (see above).
MetadataLastModified Point in time recording when the last change to metadata occurred. (The last modification of the media content is not recorded here; it is generally assumed to be present in the file information itself.)
Fields derived from dc:coverage:
CountryCodeList The geographic location of the specific entity documented by the media item, expressed through a constrained vocabulary of countries using 2-letter ISO country code (e. g. "it, si"). Accepted exceptions to be used instead of ISO codes are: "Global", "Marine", "Europe", “N-America”, “C-America”, “S-America”, "Africa", “Asia”, “Oceania”; this list may be extended as necessary. This should always be present if CommonGeoArea is present.
CommonGeoAreaName The single highest geographic area, in the language of your metadata; e. g. country name, name of national park. This may be "Global", “Europe”, “Germany”, “Oceans”, etc. Do not use country codes, but spelled out names in the metadata language! This should always be present if SpecificGeoArea is present.
SpecificGeoAreaName Actual geolocation of observation (city, location details down to the village, forest, etc.). Do not repeat the common geographic area.
Fields derived from dc:subject:
TaxonCategory Constrained vocabulary of highest taxonomic groups like vertebrates, fungi, etc. The list of controlled vocabulary terms is available further below.
LowestCommonTaxon The lowest taxon integrating all taxa covered by the resource or resource collection (e. g., the name of the family from which some genera are keyed out). Example: “Aves” for a bird key or a bird image collection. Do not add a rank (“Class Aves”). The purpose of this field is similar to the purpose of TaxonCategory. However, this may be free form text and may be more specific than the controlled vocabulary in Taxon Category allows for. If the resource contains a single taxon, this should be placed only in TaxonList, leaving LowestCommonTaxon empty.
TaxonList Semicolon-separated list of all taxa covered. If possible, add this information even if the title or caption already contains the taxon names. If this is not possible, at least for identification tools and media collections the number of taxa should be given in TaxonCount (see below). Please do not use abbreviated Genus names here! Do not repeat the LowestCommonTaxon here.
TaxonCount Please give an exact or estimated number of specific taxa in any case, even were a complete list of taxa is not available or practical. Please try to give this information even where not required. The count should best contain only the taxa covered fully or primarily by the resource. For a taxon page and most images this will be “1”, i. e. other taxa mentioned or in the background should not be counted. However, sometimes a resource may illustrate an ecological or behavioral entity with multiple species, e. g. a host-pathogen interaction. This should be a single integer number. Leave the field empty if you cannot estimate the information (do not enter 0).










If desired, rank-specific taxon counts may be given in addition to TaxonCount. SupragenericTaxonCount includes all higher taxonomic ranks above (but not including) the genus, infrageneric taxa are the ranks between genus and species (not including either), and infraspecific ranks include subspecies, variety, forma.

The content of these fields should be single integer numbers. The sum of these detailed field should be equal to TaxonCount.(Usage note: the fields have been requested by partners, but not actually been used; in a first phase, they will not be supported by the repository or the search engine.)

ScientificNameSynonyms Applicable only if the resource relates to a single taxon: a semicolon-separated list of scientific names that are synonyms may be provided here.
VernacularNames Applicable only if the resource relates to a single taxon: semicolon-separated list of vernacular (= common) names; each name with a language code in parentheses behind it. Example: "abete bianco (it); Tanne (de); White Fir (en)"
DocumentsSpecimen Free-form text specifying that a resource documents some aspect (habitat, morphology, behaviour, organism interaction) of specimens preserved in museums or culture collections (“strains”), or observations recorded in observation databases. Examples: for NHM “BM 23974324” for a barcoded or “BM Smith 32” for a non-barcoded specimen; for UNITS: “TSB 28637”; for PMSL: “PMSL-Lepidoptera-2534781”.(Usage note: this field has not actually been used; it is planned to support it nevertheless.)
GeneralKeywords Keywords or "tags". Character or part keywords like "leaf", "flower color" are especially desirable. Where possible, use the more specific categories provided below and exclude scientific names, common names, geographic locations, from keywords
ContentType Intended to distinguish between different types of content representation like “line drawing, grayscale drawing, color drawing, grayscale photo, color photo”. Occasionally a distinction between normal, light-microscopic, TEM or SEM photos may also be desirable. Both field name and scope should be further discussed!

(Usage note: this field has not actually been used; it is planned to support it nevertheless.)

Fields derived from dc:source:
PublishedSource If image, key, etc. was taken from (i.e. digitized) or was also published in a digital or printed publication. Do not put generally "related" publications in here. This field normally contains a free-form text description; it may be a URL (“digitally-published://ISBN=961-90008-7-0”) if this resource is also described separately in the data exchange.
DerivedFrom Derivation of one resource from another is of special interest for identification tools (e. g. a key from an unpublished data set, as in FRIDA, or a PDA key from a PC or web key) or web services (e. g. a name synonymization service being derived from a specific data set). It may very rarely also be known where one image or sound recording is derived from another (but compare the separate mechanism to be used for quality/resolution levels). – In such cases please enter either the URL used elsewhere, or – if not available – a simple name of the “parent” tool in this field.

(Usage note: this field has not actually been used; it is planned to support it nevertheless.)

ContentModification If media content has been modified or edited significantly in ways that are not immediately obvious or expected to consumers this must be documented and explained. Examples for images are: Removing a distracting twig from a picture, moving an object to a different surrounding, or changing the color in parts of the image. Modifications that are standard practice and expected or obvious to users are not necessary to document. Examples for images are: Changing resolution, cropping, minor sharpening or overall color correction, clearly perceptable modifications (adding arrows or labels, combination or multiple pictures into a table. If it is only known that significant modifications were made, but no details are known, a general statement like “Media may have been manipulated to improve appearance” may be appropriate.
Field derived from dc:format:
Format Necessary only, if offline and digital, or if the URL does not include an extension (e.g., "" may be an image, a sound, or a web page). Three types of values are acceptable: (a) any MIME type; (b) common file extensions like txt, doc, odf, jpg, png, pdf; (c) the following special values: Data-CD, Audio-CD, Video-CD, Data-DVD, Audio-DVD, Video-DVD.
Specific fields for identification keys:
Keys_Interactivity Fixed values “Static”, “Hyperlinked”, and “Dynamic”. Both single-access (= dicho-/poly­to­mous) and multi-access keys may or may not be presented dynamic (also called “interactive” or “adaptive”). “Hyperlinked” is intended for Html or PDF hypertext documents limited to simple links (connecting parts of the key or leading to external resources).
Keys_Structure Fixed values: “Dichotomous” (single-access key with branching limited to two leads), “Polytomous” (single-access key with at least occasionally more than 2 leads), “Multi-access” (the sequence of characters or leads can be freely chosen by the user), “Multi-entry” (in a first step, a free choice of multiple characters is available, followed by a single-access or browsing structure), “Browsing” (descriptions or images arranged in a long sequence like field guides). If an identification tool contains several different keys, this may be a semicolon-separated list.
Keys_HostApplication The software necessary to use the identification tool, whether this software is distributed with the tool or not. Examples: “Web browser”, “PDF Reader”, “Lucid Player”, “Linnaeus II player”. Use the value “Custom” if the identification tool is uniquely coupled with custom-programmed software that has no independent name. See also the section “Value lists” below for further examples.
Keys_TargetSystem InteractiveKeys may be available for different target systems/hosts (Example values: “Web”, “Java”, “.NET”, “Mac”, “Windows”, “Win98”, “Linux”, “PDA”, “Smartphone”; further values may be added as necessary, especially where specific OS versions are targeted). Multiple values may be used, separated by a semicolon, if a single identification tool is provided for multiple hardware or operating systems. Do not add “Mac”, “PC”, “Linux”, etc. where generic host applications (like Web browser or PDF reader), or virtual machines (like JAVA or .Net) are targeted. See also the section “Value lists” below for further examples.
Keys_ExchangeFormats Potential data interchange formats for identification keys. Example values: “DELTA”, “SDD”, “NEXUS”, “Sybase DB”, “Filemaker DB”, “Comma separated values”. Multiple values may be used, separated by a semicolon. See also the section “Value lists” below for further examples.
For individual resources:
BestQualityURL Best available quality (which may be non-digital or offline). Use this field if only one quality level is available (as it is typical for taxon pages or keys!) This may also be a published CD, etc. (For Provider and Collection resource types, use the “Homepage” field instead!)
GoodQualityURL Quality intended for resources displayed as primary information; e.g. an image between 600 and 1200 px

MediumQualityURL Intermediate quality, e.g. shortened or using a higher compression causing moderate artifacts.
LowerQualityURL A smaller/shorter quality that still contains significant information like a 3-5 second birdsong, an image around 150-300 px, etc. Typical for information displayed in a series, e.g. a list of images of states of a character.
NormalPreviewURL Preview, not normally sufficient as an information source in itself, e.g. a short 3 second clip of a bird song, or an image thumbnail perhaps 80-160 px large.
TinyPreviewURL A yet smaller preview, e.g. for images a thumbnail less than 80 px large.

General notes

  1. Do not enter “empty”, “no”, or “-“ in fields to indicate that they contain no information or are inapplicable.
  2. Items that are on sale should use both an appropriate URL-equivalent (e. g., “digitally-published://”, see “Availability and URL notation”, below) and express the availability for purchase in the License field (e. g. “Individual licenses may be purchased through bookstores”). The IPR workpackage may add additional input for this.
  3. Formats may vary among different quality levels (png-image for high quality, jpg-image for lower). A deliberate decision was made that this cannot be expressed with the current format field. If indeed explicit format information is required, because the format cannot be deduced from the URLs (and thus the format field is redundant and empty), we need to find ways to solve it.
  4. Sometimes resources differentiate between “title” and “subtitle”. A subtitle may be a second sentence of the title, or it may be rather a longer, description of the product. To simplify the structure of web interface, the title field should be a complete, human-readable representation of the item. For example, in two resources with “Title = Flora of Erehwon; Subtitle = Gymnosperms” and “Title = Flora of Erehwon; Subtitle = Angiosperms” , the subtitle should be added to the title field to generate a usable title (“Flora of Erehwon. Gymnosperms”). However, occasionally the subtitle contains one or two sentences describing details of the resource and belongs into the Description field.
  5. The copyright and license statements should be complete statements. Copyright=”M. Name” is not a copyright statement, Copyright=”© M. Name 2007” is a copyright statement. It is very deceptive to leave the “copyright” part of the statements away if the field is already labeled such. However, the web interface can not possibly know the correct way to express copyright in different languages and legal systems so you need to provide a complete statements.
  6. For most exchange formats (e. g., Database formats, tab-separated Unicode text), the sequence of metadata fields does not matter and you can rearrange the fields in a sequence different from the one in the table above.

Multilingual Metadata

Some providers have metadata such as title, descriptions, keyword in more than one language. It is highly desirable to provide your native language to the metadata index and not only translations.

The metadata exchange format is intended to allow more than one data row for each resource. Each row must be distinguished by the language in MetadataLanguage, and has information for title, caption etc. in the corresponding language. The rows are kept together by the resource identifier. For collections, this is the CollectionID (required!), whereas for identification tools, taxon pages and media resources this is the BestQualityURL.

An example is given in the following table, showing metadata for a single media resource in two languages. Please do not add new fields like “TitleEn”, “TitleFr” to the metadata files.

Type Title MetadataLanguage BestQualityURL
StillImage Oak infected with powdery mildew en
StillImage Mit Mehltau infizierte Eiche de

The first resource URL in such cases will be same and serve to keep the multilingual metadata together. Note that occasionally closely related resources may have different URLs in different languages (e. g. if an image contains language-specific text). In such cases the field “Language” (which is different from MetadataLanguage) should also be set to the language of the resource itself.

Note that whenever possible MetadataLanguage should only be a single value. Note that it is not necessary to provide all metadata fields in all languages. Simply create the record in a secondary language only with those fields that are available (but do use the BestQualityURL field in all records, so that the records can be reconnected when converting and integrating the data).

Note: Experience with the first metadata collection in the Key to Nature project showed that it has not become clear enough that a BestQualityURL should always be provided whenever possible, even if the metadata are not multilingual or if the resource is non-digital, unpublished, or both. To support the recombination of multilingual metadata, a unique URL is required.

Availability and URL notation (including “pseudo-URLs” for non-internet resources)

The most complex part of the data exchange is perhaps the proposed structure for transferring information about the location and availability of a resource (“URL” fields at end of previous table). This is a “denormalized” structure intended to avoid another relation for the quality levels of resources. Especially media resources like images or sounds are typically available in different quality levels (e. g., high-resolution, web-optimized, or different thumbnail sizes), under different URLs. If your resources exist only in a single quality level (typical for identification keys or taxon pages), concentrate only on the fields BestQuality… and ignore all fields starting with Good…, Medium…, Lower…, NormalPreview…, or TinyPreview

For online digital resources we use URLs (Universal Resource Locators, i. e. the addresses one would type into the address field of a web browser). However, a resource may be available only after login or it may be offline (digital or non-digital, published or not). Because this information interacts in complex ways with URLs, and because this interacts with quality levels (one quality may be available only non-digital, another after login, the remaining publicly online), we also use the URLs to express this information. The following fixed vocabulary is used:

Prefix in URL Availability Value Description Examples of prefixed URLs
online (free) No prefix indicates a publicly visible URL.
login- online (login) Prefixing the URL with “login-“ indicates that it is accessible only after a login or sign-in (e.g., username, password, token) login-
file:// unpublished (digital) Local digital files are indicated by this value or prefix (compare Use “file://” as a value if you want to keep the file location private, use it as a prefix if you want to share local file locations (e. g to simplify communication). file:// may also refer to unpublished local CDs. file://file://Oudemans/Storage/Fungi/234.jpg
digitally-published:// published (digital) Use as value or prefix for published digital content that is not available online (especially CDs). Please use text or codes after the “//” to create a unique identifier to the published resource. An ISBN number is a good choice if available. See also the field “PublishedSource”, which may repeat some of this information in greater detail. digitally-published://ISBN=234-23-23444-X
published:// published Use as value or prefix for published, non-digital content. See also the field “PublishedSource”. published://ISBN=234-23-23444-X
unpublished:// unpublished Use as value or prefix for unpublished, non-digital content like printed photos, slides, analogue tapes, etc. Use information like author, year, title, project, etc. to create a unique descriptive string after the “//” unpublished://CollectionMattold

The flat-table exchange standard uses the URL prefixes, since different quality levels may have different availability. For the resource index compiled from this information, as well as for the Wiki-based metadata repository, the values in “Availability-Value will be used for each quality level.

The following example indicates that for the collection JKI_Fus the best quality is digital, but only locally available, and that three publicly available quality levels are present. For JKI_Oo, the best quality is available after login, and two quality levels are available publicly.

Type CollectionID BestQualityURLPrefix GoodQualityURLPrefix LowerQualityURLPrefix TinyPreviewURLPrefix
Collection   JKI_Fus
Collection JKI_Oo
Type CollectionID BestQualityURL GoodQualityURL LowerQualityURL TinyPreviewURL
StillImage   JKI_Fus file://f3223.png /Fungi/Fus/web/f3223.jpg /Fungi/Fus/low/f3223.jpg /Fungi/Fus/pre2/f3223.jpg
StillImage JKI_Oo login-http:// Storage/Fungi/Oomyc/ ori/o8888.png /Fungi/Oomyc/web /o8888.jpg /Fungi/Oomyc/pre2/o8888.jpg

Value lists (= content standards)

For several data fields we will be using “value” (or “content”) standards.

Type: The resource-type-values are partly taken from DublinCore (, but extended to suit the specific needs of the project:

  • UnicodeQC: (Fixed value. This is required a single time in each data exchange.)
  • Provider: (This is required a single time to represent your organization/institution.)
  • Collection: (This is required at least once to represent a resource collection. All individual resources should be grouped into collections. The collections may be used to express common information, many metadata are inherited from the collection to individual resources/)
  • StillImage: A normal, non-moving image (preferred formats are png and jpg)
  • Sound: A sound recording
  • MovingImage: A video
  • Map: A map. The format may be an image, PDF, a web page, a dynamic service, etc.
  • TaxonPage: A resource (html, pdf, etc.) describing a species or other taxon as free-form text, optionally enriched with images or other rich media content (subtype of
  • GlossaryPage: A resource (html, pdf, etc.) giving a definition or explanation of descriptive terms (parts, stages, char., states, etc.) as free-form text, optionally enriched with images or other rich media content (subtype of
  • IdentificationTool: A document (e. g., in pdf format), data set, or software with embedded data, that may be used to identify organisms or other entities (parts, diseases, etc.). The various types of keys are specified in separate metadata fields.
  • DescriptiveDataset: A data set that is not already listed as part of an identification tool. Of special interest are those data sets that are used to generate specific identification tools, but are not made available themselves.
  • VernacularNameDataset: A list of vernacular (= common) names, possibly in multiple languages. NOTE: please treat the entire list as a single resource annotating number of taxa and taxon coverage; the detailed data will be exchanged at a later time.
  • SynonymDataset: A list of scientific (= “Latin”) names with synonyms. NOTE: please treat the entire list as a single resource annotating number of taxa and taxon coverage; the detailed data will be exchanged at a later time.
  • Dataset: Other data sets that do not fit the specific data set categories: DescriptiveDataset, VernacularNameDataset, SynonymDataset
  • Software: Any software (A computer program in source or compiled form) that may be of relevance to the project and is distributed independently of an IdentificationTool.
  • Service: A general service. Use “IdentificationTool” for specific identification services.
  • InteractiveResource: An online-resource that can only be experienced by interacting with it. Typical examples are query/search pages for synonym or common name lookup. Use “IdentificationTool” for interactive identification services.

Note: Taxon or species pages (resource type: “TaxonPage”) usually contains text plus media like images (“StillImage”). If possible, it would be highly desirable if you could provide both the taxon page resource and URLs for the individual images.

Language/MetadataLanguage: Please use the ISO 639 2-letter language codes. Examples are “en, de, it, fr, et, es, sl, ro, bg”. See the first column in or We don’t expect to cover languages available only as three-letter codes.

It is desirable to distinguish language-neutral resources (like images without text on them) from those usable only under a specific language (images with text overlay, sound recording with a speaker announcing the organism, etc.). ISO 639-2 defines three code elements for special situations: mul (multiple languages) should be applied when many languages are used and it is not practical to specify all the appropriate language codes,und (undetermined) is provided for those situations in which a language or languages must be indicated but the language cannot be identified, and zxx (no linguistic content) may be applied in a situation in which a language identifier is required by system definition, but the item being described does not actually contain linguistic content.

We therefore request to code resources with unknown language as “und” (undetermined) and those that are language-neutral as “zxx” (“ind” is code for Indonesian!). Where necessary, metadata referring exclusively to scientific organism names will be coded (somewhat incorrectly) as “la” (Latin).

CountryList: Please use the ISO 3166 2-letter country codes. Examples are “UK, DE, IT, FR, EE, ES, SI (sic!), RO, BG”. See table “Officially assigned code elements” in

Note: It is recommended to use upper-case codes for countries and lower-case codes for languages, but for the purpose of our data exchange this does not matter.

TaxonCategory: This is fixed list of high-level taxonomic or ecological groups like plants, mosses, fungi, algae, etc. intended for general orientation of the user rather than for strict taxonomic classification purposes. This list cannot and does not follow a phylogenetic taxonomy. It includes ecological groups like lichens and paraphyletic groups (those excluding some groups). For all categories, the higher categories should be used only if further information is missing or indeed a mixture or lower categories is present. The terms in brackets are for clarification, and should not be used in data exchange. Thanks to Gideon Gijswijt for helping with this.

"TaxonCategory" should group rare taxa in a way that makes it easy to find commonly searched groups, and still allows to find the rare ones. It should be more an orientation-feature, than exact taxonomy. TaxonCategory is supplemented by LowestCommonTaxon, allowing expressions like "TaxonCategory = Insecta" – "LowestCommonTaxon = Lepidoptera".

Note: The following list is probably still too long, but let us start with something too long rather than too short… Only bold-printed terms should be used as values; please observe hyphens where present.

  • Viruses
  • Prokaroyta
    • Archaea (= archaebacteria)
    • Bacteria
  • Eukaroyta (all following groups)
  • Fungi-sensu-lato (paraphyletic grouping of taxa considered “fungi” in the classical sense, including Fungi, Oomycota/downy mildews, Myxomycetes)
  • Fungi (here without Oomycota/downy mildews, see Chromista for those)
    • Glomeromycota
    • Zygomycota
    • Chytridiomycota
    • Ascomycota
    • Basidomycota
    • Deuteromycota (= fungi imperfecti, mitosporic fungi)
    • Lichenes (lichens, mostly ascomycetes in symbiosis with various algal groups)
  • Algae (non-taxonomic term, encompassing parts of Prokaroyta, Protozoa, Chromista, and Plantae)
  • Chromista (= part of “algae”)
    • Cryptophyta
    • Haptophyta
    • Labyrinthulomycota
    • Ochrophyta
    • Hyphochytriomycota
    • Oomycota (= traditionally part of fungi)
    • Sagenista
  • Plantae (use this only if a mixture of algae, mosses, etc.!)
    • Plant-fossils (summarize all kinds of extinct groups here)
    • Small-algal-groups (Cyanidiophyta, Glaucophytam, Prasinophyta)
    • Rhodophyta (red algae)
    • Chlorophyta (green algae)
    • Charophyta (green algae, here as paraphyletic group excluding embryophytes)
    • Bacillariophyta (diatoms)
    • Moss-like-plants (= non-vascular land plants traditionally considered “bryophytes”)
      • Anthocerotophyta (hornworts)
      • Bryophyta (leafy mosses in the sense of Takakiopsida, Sphagnopsida, Andreaeopsida, Andreaeobryopsida, Polytrichopsida, Bryopsida)
      • Marchantiophyta (= Hepaticophyta, “Hepatophyta”, liverworts)
    • Tracheophyta (= vascular plants, “plants” in the common sense)
      • Lycopodiophyta (clubmosses)
      • Pteridophyta (ferns)
      • Equisetophyta (horse tails)
      • Seed-plants (spermatophytes)
        • Basal-seed-plants (Cycadophyta/cycads, Ginkgophyta/ginkgo, Gnetophyta)
        • Pinophyta (conifers)
        • Magnoliophyta (flowering plants)
  • Protozoa
    • Acrasiomycota, Apicomplexa, Cercozoa, Choanozoa, Ciliophora, Dictyosteliomycota, Dinophyta, Euglenozoa, Myxomycota, Myzozoa, Plasmodiophoromycota, Sarcomastigophora, Acantharia, Filosia, Granuloreticulosea, Haplosporea, Heliozoa, Labyrinthulea, Lobosa, Sporozoa
  • Animalia
    • Invertebrates (any kind of invertebrates as a non-taxonomic grouping; use only if no better category is available)
    • Porifera (sponges: Calcarea, Demospongiae, Hexactinellida)
    • Ctenophora (comb jellies: Nuda, Tentaculata)
    • Cnidaria (stinging animals: Anthozoa (sea anemones and corals), Cubozoa (box jellies), Hydrozoa, Scyphozoa (jellyfish), Staurozoa, Myxozoa)
    • (Superphylum Deuterostomia:)
    • Echinodermata (Ophiuroidea (brittle stars), Crinoidea (feather stars), Holothuroidea (sea cucumbers), Asteroidea (sea stars), Echinoidea (sea urchins), Somasteroidea)
    • Hemichordata (acorn worms: Enteropneusta, Pterobranchia)
    • Invertebrate-chordates (Appendicularia, Ascidiacea, Thaliacea)
    • Vertebrata (vertebrate chordates)
      • Fish (paraphyletic grouping)
        • Other-Fish (= rare fish groups: Cephalaspidomorphi, Cephalochordata/lancetfish), Myxini/hagfish)
        • (Superclass Osteichthyes (bony fish):)
        • Actinopterygii (ray finned fish)
        • Sarcopterygii (lobed finned fish)
        • (Superclass Chondrichtyes (cartaligonous fish))
        • Elasmobranchii (rays and sharks)
        • Holocephali (chimaeras)
      • Amphibia
      • Aves
      • Dinosaurs (special category, only fossil group listed separately!)
      • Reptilia
      • Mammalia
    • (Superphylum Ecdysozoa)
    • Nematoda (= Nemata, nematodes)
    • Tardigrada (water bears)
    • Cephalorhyncha (other worms: Kinorhyncha, Loricifera, Nematomorpha, Priapulida)
    • Arthropoda (arthropods)
      • Arthropod-fossils
      • Crustacea
        • Branchiopoda
        • Remipedia
        • Cephalocarida
        • Maxillopoda
        • Ostracoda (seed shrimps)
        • Malacostraca (crabs, lobsters, shrimps)
      • Chelicerata
        • Arachnida (spiders)
        • Merostomata (horseshoe crabs and eurypterids)
        • Pycnogonida (sea spiders, Pantopoda )
      • Myriapoda
        • Chilopoda (centipedes)
        • Diplopoda (millipedes)
        • Pauropoda (rare millipede-like group)
        • Symphyla (garden centipedes)
      • Hexapoda (= insects in a traditional sense)
        • Entognatha (Collembola, Diplura, Protura)
        • Insecta (insects in the strict sense)
    • (Superphylum Platyzoa)
    • Platyhelminthes (flatworms: Cestoda, Trematoda, Turbellaria)
    • Rotifera (rotifers)
    • Rare-Platyzoa (rare marine or parasitic worm groups: Gastrotricha, Acanthocephala, Gnathostomulida, Micrognathozoa, Cycliophora, Mesozoa)
    • (Superphylum Lophotrochozoa)
    • Mollusca (mollusks)
      • Bivalvia (scallops, clams, oysters, mussels, etc.)
      • Cephalopoda (octopuses, squid, cuttlefish, etc.)
      • Gastropoda (snails and slugs)
      • Polyplacophora (Chitons)
      • Scaphopoda (tusk shell)
      • Other-Mollusks (small groups like Monoplacophora, Aplacophora)
    • Annelida (segmented worms, here only Clitellata, Pogonophora, Polychaeta)
    • Rare-Lophotrochozoa (rare groups, usually marine: Echiura (spoon worms, a small group of marine animals), Sipuncula, Nemertea (ribbon worms), Phoronida, Ectoprocta (= Bryozoa, moss animals), Entoprocta, Brachiopoda (lamp shells))

(End of TaxonCategory list)


  • Line-drawing (a drawing essentially black and white; the image format may be between, gray-scale, or color image)
  • Gray-scale-drawing (artistic drawing with shaded areas; for simple shading using lines or dots use Line-drawing)
  • Color-drawing (e. g. water-color, oil painting)
  • Gray-scale-photo (photographic image in “black-and-white”)
  • Color-photo

MIME_Format: In addition to common file-extensions recognized by browsers, any MIME code (see may be used here. (Note: This is necessary only if the format of a digital resource cannot be inferred from its URL. If your URL ends in common file extensions like “.jpg/.jpeg”, “.png”, “.gif”, “.tif/.tiff”, “.mpg/.mpeg”, “.htm/.html”, ”.pdf/.doc/.txt/.odf”, etc. this field may be left empty.)

Offline digital and non-digital availability encoded in URLs: the fixed values are given above under “Availability and URL notation”.

Value standards for specific identification tool metadata (prefix “Keys_”):

(Many values are already provided in the metadata field comments above; some lists are elaborated here. This version should be considered normative if conflicts between the two definitions should be present.)


  • Static = no dynamic change of structure and presentation of key occurs in response to user interaction (as in conventional, printed or printable single-access or multi-access keys).
  • Hyperlinked = principally static and printable key that is enhanced with simple hyperlink jumps or popup-tooltip text (as commonly found in html or pdf documents).
  • Dynamic =dynamically changing structure and presentation in response eto user interaction and progress of identification (as in typical software identification tools).


  • Dichotomous = single-access key with branching limited to two leads
  • Polytomous = single-access key with at least occasionally more than two leads
  • Multi-access = the sequence of characters or leads can be freely chosen by the user
  • Multi-entry =a free choice of multipe characters is available in a first step, followed by a browsing or single-access key structure
  • Browsing = descriptions or images arranged in a long sequence like field guides
  • (Extensions to this list are possible but require a discussion and following consensus.)


  • Web browser = compatible with typical web browsers like Internet Explorer vers. 5-6, Firefox 1-3, Netscape 7 or higher.
  • IE7, Firefox 2, Firefox 3 = specific web browsers (if compatibility is limited to this software, e. g. for Firefox plugins).
  • PDF Reader = Any software capable of displaying PDF files
  • Acrobat Reader 7 = a specific software if compatibility is limited to this.
  • Lucid 2 Player = CBIT Lucid software in version 2
  • Lucid 3 Player = CBIT Lucid software in version 3
  • Linnaeus II player = ETI’s Linnaeus software (the player).
  • Intkey = CSIRO’s Intkey software
  • Custom = use this value if the identification tools is uniquely coupled with custom-programmed software that has no independent name.
  • (These values are freely extensible.)


  • JAVA, .NET = the JAVA or .NET virtual machine
  • Mac, Mac OS X = generic or specific Mac versions
  • Windows, Win95, Win98, WinME, Win2k, WinXP, Win2k3, Vista = generic or specific Windows versions
  • Linux, Linux 2.2 = generic or specific Linux versions
  • PDA = generic for Personal Digital Assistants / Smartphones.
  • Symbian OS, Palm OS, Windows Mobile, RIM BlackBerry = specific PDA or Smartphone operating systems.
  • These values are freely extensible, but the values above should be used where possible.


  • SDD1.0, SDD1.1
  • SybaseDB
  • FilemakerDB
  • AccessDB
  • Excel97-2003, Excel2007
  • Tab-separated-values
  • Comma-separated-values
  • These values are freely extensible, but the values above should be used where possible.

Some definitions

“Identification tool”: In the Key to Nature consortium the term is defined as the combination of an identification data set and a software application to use it. An identification tool allows the identification of an organism within a given group, for a defined area, in a defined season, etc. Examples are "Key to the butterflies of the XY National Park", “Winter key to European Shrubs”, or “British poisonous plants”. Key to Nature limits it scope to digital identification tools, which may exist in a wide range of types, from complex interactive tools to simple HTML pages or PDF documens.

Normally, an identification tool will include the necessary software to use it (in contrast to a descriptive dataset). However, the software application does not need to be provided directly, if it is typically found on the targeted hardware system (e. g., JAVA or .NET virtual machines, web browsers, PDF readers) or if a link is provided where a tools can be downloaded free of charge.

Identification tools may or may not be specific to different hardware. Where tools are provided in different versions for different hardware (e. g. PDA or mobile phones), these may be considered different resources. Capturing the relation between such variants of tools is desirable; it may be achieved by considering one tool to be derived from the other (field “Keys_DerivedFrom”). – If the same text and image content is provided for different hardware (perhaps with automatic image sizing to available screen space, adaptations for grayscale screen, modes for people with disabilities, etc.), this may be considered a single tool. The field “Keys_TargetSystem” may then contain a semicolon-separated list of systems like “Mac; WinXP; Win2k3; Vista; PDA”.

(Note on other uses of the term: Software applications developed to identify organism using externally stored data (Intkey, Lucid, Palmkey, etc.) are often also called “identification tools” as a generic category, rather than as the specific product.)

“Identification data set”: In most identification tools the software application is general and can be combined with different data sets. These data sets contain the knowledge required to identify biological entities (organisms, diseases, symbioses, etc.) within a specific scope (taxonomic, geographical, ecological, seasonal, etc.).

Different types of data sets formats can be distinguished: The original knowledge base used by the editing application (“builder”), a transformed format, “compiled” or “optimized” for the identification tool (“player”), and a document-like presentation format directly suitable for display in web browsers or other readers software. For example, DELTA, SDD, Lucid LIF or Xper2 are knowledge-base formats, the Intkey or Lucid player format is a compiled format for specific software, and FRIDA keys, SLIKS/SAIKS keys are presentation documents. Depending on the software, the production of an identification tools may involve different data formats or all steps may be present, key software may run directly on a knowledge base (Xper2, Navikey), or all functions may be rolled into a single format (as in SLIKS/SAIKS).

Where a knowledge base is not or only incompletely published as part of an identification tool, it is desirable to report the knowledge base as a resource of type “DescriptiveDataset”.

Static identification key: These identification tools provide only minimal interactive functionality. An example is a single pdf or html document, with or without hyperlinks.

Single-access key: An identification key that functions like a printed dichotomous or polytomous key. It may use a static display method or interactive user interface techniques. The sequence of questions (couplets, leads) is fixed by the creator of the key.

Multi-access identification key: Here the sequence of characters (“questions”, “couplets”) can repeatedly be freely selected by the user. The process of answering one or several character questions is followed by a report of the remaining taxa and can be repeated until the identification result is obtained.

Multi-entry identification key: Here a freely selectable choice of (usually of several) characters (“questions”, “couplets”) is available in a first step (typically using an interactive form providing a subset of all characters). The process can, however, not be repeated and the identification continues using a (dynamically generated) single-access key.


Please add any information you consider desirable! You may have further information on resources that you consider valuable and which here may have been either simply forgotten or considered unlikely to exist. Please do provide us with such data; we may well be able to process and integrate it. Please simply use your local field name (or an appropriate translation to English) and prefix it with an “x_“. Some examples for cases that were considered but not included in the field names of the general list:

  • x_GUID: Globally unique identifiers other than the URL used to provide permanent identifiers for resources. Examples:, Note: Until 2008-09, no provider had such a practice.
  • x_Rating: Ratings of technical quality, content quality, or suitability for a given purpose (e. g., identification, teaching, glossary)?
  • x_HostScientificName, x_PathogenScientificName, etc.

Data packaging (file or upload formats)

The data may be transferred by various ways. All data exchange during the first two months of the project had to be limited to the use of documents, transferred manually through email or web-uploads. Although this method allows no automatical updating in the future and relies on repeated data integration, it is within the technological capabilities of the project members.

For the future, we are considering on the one side for technologically advanced providers direct support for the Open Archive Initiative Protocol for Metadata Harvesting, through the use of Fedora Commons. This would support automatic updating of the shared metadata about identification tools and media resourcse. However, the protocol requires custom software to be installed at the providers and an xml-schema would have to be created from the elements defined in this exchange agreement. See and for further information.

On the other side, we would like to improve the interaction with “normal” partners by using the KeyToNature MediaWiki for uploading and syntax checking of data. The preferred “simple” method will then be to upload the metadata into the Key to Nature wiki, using a wiki-specific markup of the data. The markup is based on the template syntax; each record starts with double braces (“{{“) followed by the identifier of the template, a number of “|”-separated parameters, and ends with double braces again (“}}”). The precise format will be documented on the wiki. We are just starting to test this method; see “Planned Wiki based metadata exchange” further below.

Among the manually exchangeable file formats, the preferred transfer formats are:

  • Tabulator-delimited Unicode text files: Text should not be quoted, i. e. not surrounded by single (‘) or double (“) quotes, a tabulator (= character 09) should be present between the fields, the rows should be separated by a new-line character (consistently on of: hexadecimal 0A, 0D, or the 0D+0A combination). New-line character present in the text should be escaped by a defined mechansims, e.g. as “\n”. The first row of the file should contain the field names. Both UTF-8 and UTF-16 formats are acceptable, both with and without Byte-Order-Marks (BOM). Date values must be provided as “yyyy-mm-dd”, date/time values as “”, floating-point values using the “.” as decimal separator. Advantages: Highly compatible across different operating systems. Disadvantages: less control on type conversion of non-string fields (e. g. dates), testing necessary to guarantee that no non-Unicode application are used in the workflow.In the case of tab-delimited files, please make sure that your software does not add any quotes (“) around text. Please check whether your data contain new-line characters (carriage-return, line-feed), which is used in this format as the record-delimiter. Please add the fieldnames in the first row only; do not repeat them in front of each value.
  • Microsoft Access Database (Access 97 to 2007). Advantages: Unicode enabled; no problems with large field sizes, relatively widely distributed. Disadvantages: Requires Windows operating system.
  • Text documents containing tables in HTML or RTF format. This method is safer than the Excel transfer (see below), but more difficult to process.

Less desirable formats are:

  • Comma-delimited quoted Unicode text files with field names in the first row (“Unicode-CSV”). The comma delimitation either requires an escape method for comma in the text or quotes around text plus an escape method for quotes (which may occur in the text). These quoting and escaping methods vary (quotes may be escaped with backslash or by doubling the quote, text may only be quoted if it contains a quote, which in turn creates problems when the entire text is already quoted in the source data, etc.) and may require custom programming to deal with.
  • Microsoft Excel spreadsheets (97 to 2007). Advantages: Unicode enabled. Disadvantages: Requires Apple or Windows Operating Systems, may create difficult-to-detect problems with fields larger than 255 characters (Excel internally supports large text, but it may be difficult to convert such fields to a different format, they may become truncated or replaced with “########”.
  • Various xml formats. We will be able to process them, but note that different database or spreadsheet applications create xml data that conform to widely different and incompatible xml schemata. It may be desirable to agree on a common schema, or to provide a standard transformation for some common formats; this needs further investigation and agreement with the capabilities of providers. (Note on Oracle xml-exports: as of 2008-09, Oracle continues exports invalid xml. To fix this, open the file with a non-xml editor as UTF-8 text file and change "UTF8" to "UTF-8". The result can be processed and imported, e. g., into Microsoft Access.)

These formats are acceptable if a provider has problems delivering data in one of the preferred formats. On a case-by-case basis we will also be considering any other formats that you may want to propose.

Quality control for Unicode character set

Unicode characters may become corrupted if some application in the data processing sequence does not handle them properly. To help with the quality control of data transfer, each data transfer file should include a record that consist of the fixed value “UnicodeQC” for the Type field and the literal text: “«Unicode Test: ¿ŠšǍǎ – are S and A caron preserved?»” in the Title field (see example below, all other metadata fields may be empty). Please copy the test string exactly, including the guillemets, but not the English quotes using the clipboard or some other means. To control for your internal processing, please try to create this record as early as possible (at least if you suspect that you may have accented characters, e. g., in person names).

Type Title
UnicodeQC «Unicode Test: ¿ŠšǍǎ – are S and A caron preserved?»
Collection …   (etc.; i. e. all other records like Collection,
StillImage …   StillImage following the UnicodeQC record)

Problems experienced in file-based data exchange

The mechanism described in this document was used for the initial document exchanges required to be finished by month 3 and 4 of the Key to Nature project and aims to support partners with little IT experience and support. We found, however, that the aggregation is laborious, error prone, and quality control is difficult. The quality control for Unicode conversion errors (see above) spotted many character set encoding errors, the variation of Byte Order Marks being present or not lead to the need of experimentation which Unicode encoding a file might have, the tab-delimited UTF files often had quotes escaped (despite that this should not be necessary), and new line characters present in the data often caused disruption by being not escaped or escaped in unexpected ways. All these problems are well known and a reason for the widespread use of xml data formats. The lack of compatible xml export routines in standard software, however, prevented us from using xml as our basis. Unfortunately, xml remains a format for experts.

In addition, the three level structure desired to adequately express origin and context of search results (single resource → data collection → data provider), implemented within the flat table data exchange structure through resource types and id-based relations was often not expressed in the data being submitted. The explanations obviously were difficult to understand and several partners had difficulties in following the agreement.

In our analysis the principle problems causing this are:

  • Data exchange usually involves multiple steps of data conversion on the part of the data provider. The standard software tools typically being used have different implementations of character encoding and character escaping, only some of which may be – with difficulty – controllable by the data provider.
  • The providers have no quality control tool to adequately test the syntactical correctness of the exchange syntax, values used, or whether their data has the correct provider/collection/resource structure. Although such tools exist for xml, at the present time these are not widely available or even embedded in standard software.

In general, the process of manually following an export process, manually transferring data (email, ftp), and manual quality control and data conversion into a central repository is labor intensive. Since fully automated methods are already beyond the project partners abilities and in any event would would be a hurdle to interaction with new associated partners, we are searching for partial improvements of the process.

Planned Wiki-based metadata exchange

We currently investigate and test a method, where the uploading of data is a manual process distributed among all partners, but where the system for uploading already reports certain quality problems. The central harvesting and integration of such data is then planned to be a fully automated process. Rather than writing custom software for a repository, we plan to use a MediaWiki installation, with the following setup:

Each data provider is represented by a wiki page, containing any free-form text and images, plus a required infobox containing a short name, logo and web-address ("homepage"/URL). In addition it must be marked with the "Category:Resource Metadata Provider".

For each resource Collection, another page has to be created, containing any free-form text and images, plus a required infobox containing important metadata about the resource collection. On the collection page, a table of resources is presented, which is created by another template.

The key to this approach is the use of templates (infoboxes or table-row-templates). MediaWiki templates provide the following opportunities:

  • they provide a visually attractive reporting of the data, facilitating data proofreading;
  • they may provide error reporting facilities (required fields, incorrect values);
  • they are relatively trivial to parse for metadata harvesting.

We believe the Wiki method will prevent misunderstandings of the necessary relations from resources to collections to providers. Being browser-based it usually provides a corruption of unicode characters, and it provides immediate feed-back on certain quality problems.

The workplan for implementing this in WP4 (media resources) is:

  • testing the approach with the WP3 (identification tool) data (work under way) loaded centrally on the wiki
  • writing a general user guide how anybody may follow these examples
  • test this with the second round of updating the WP4 data
  • write harvesting software that converts the Wiki metadata repository into an integrated, searchable format (using Fedora Commons or relational databases).

Depending on the technical expertise of the data provider, the process of uploading metadata to the wiki repository may be manual or automatic. A manual conversion from the local data formats can relatively easily be done with spreadsheet and work-processing software, and we will try to write a guide to this. Since the target format is essentially plain text (which is then copied into the browser-based wiki editor using the clipboard), it can be expected that many partners will be able to follow the instructions. However, this process can be fully automated, by writing data export routines and a simple web-script which updates the web pages automatically.

An essential point in this approach is that central error reporting can be automated. In a first step, the harvesting mechanism will simply ignore any data that are not fit for harvesting. The lack of the data in the search facility will already provide a primitive form of feed-back. In a second step, it can automatically add problem reports to the wiki pages it could not process.

Draft Agreements

The first data sharing was intended for project-internal purposes only. To allow safe publication of data sharing, we request to sign the following agreements:

1. Metadata sharing

By transferring metadata (such as description, authorship, keywords, taxon names) to the consortium, the partner agrees that this information may be centrally stored and published as part of the indexing and querying service.

To accommodate collaboration with international partners such as GBIF, Encyclopedia of Life, and Morphbank, the partners place the metadata under a Creative Commons 2.5 non-commercial, share-alike, attribution required license. Data under this license must cite the provider, and may be shared with other provided the other partners share its resources under the same conditions. The prohibition of non-commercial use only limits the data that are provided to the Key2Nature consortium. It does not prevent the provider itself to give other commercial licenses separately of this agreement.

Date, Name in printed letters, and Signature

2. License to create small preview versions

In many cases the user interface for searching image, sound, or video resources profits highly from the ability to provide small previews (e. g. image “thumbnails”) of resources. These thumbnails are typically too small to fully use the information of the resource, but large enough to get an approximate impression of what the resource conveys.

Many partners will already inform the indexing system about available preview versions and these may then be used. However, the kind and size of preview versions may differ between providers and a size desirable in a given presentation may be missing. It will therefore simplify the design of the user interface, if small preview versions of resources may be centrally created and stored.

This agreement is limited to still images with a maximum extension of 160 pixel (e. g. 100 x 160 or 160 x 100), moving images with a maximum size of 320 x 240 and maximally 4 seconds in length, and sound resources of maximum 4 seconds in length.

Signing this agreement is voluntary!

Date, Name in printed letters, and Signature

3. Backup or central storage service

If so desired, the central facility may acquire full sized versions of your resources, either as a silent backup or as the primary location of serving the resource to the public. This is not required for the operation of the indexing facility, but may be a desirable service for small institutions.

If your institution is interested, please express your interest below. We will then start drafting appropriate agreements together.

Date, Name in printed letters, and Signature