Strategies for collecting resource metadata

From KeyToNature
Jump to: navigation, search

Two work packages in KeyToNature aim at improving the searchability and reuse of identification tools (WP3) and media data (WP4). This is achieved by collecting metadata about existence and properties of resources from project partners and associates, integrate them and provide a search mechanism for them. This article discusses present and future strategies to exchange the data.

Contents

1 Introduction

By necessity metadata exchange and integration must be based on some agreement, specifying semantic definitions for exchanged concepts as well as syntactical and character encoding definitions for a digital exchange format. For partners with well-equipped IT departments and expertise, typical solutions are based on xml/Unicode exchange formats, perhaps semantic web solutions (RDF/OWL-based) and, for example, the Open Archive Initiative - Protocol for Metadata Harvesting or other web-service-based methods.

Technologically advanced providers could directly support the Open Archive Initiative Protocol for Metadata Harvesting, through the use of Fedora Commons. This would support automatic updating of the shared metadata about identification tools and media resourcse. However, the protocol requires custom software to be installed at the providers and an xml-schema would have to be created from the elements defined in this exchange agreement. See http://www.openarchives.org/ and http://www.oaforum.org/tutorial/english/intro.htm for further information.

However, in the current project we found that most partners do not have these resources. With the Resource Metadata Exchange Agreement we have therefore specified simpler exchange formats, based on flat tables, which may alternatively be submitted as tab-delimited Unicode plain text, Microsoft Excel or Microsoft Access formats. We found, however, that the aggregation is laborious, error prone, and quality control is difficult. The quality control for Unicode conversion errors built into the exchange agreement spotted many character set encoding errors, the escaping of quotes and new line characters present in the data is highly variable. The underlying problems are:

  • Data exchange usually involves multiple steps of data conversion on the part of the data provider. The standard software tools typically being used have different implementations of character encoding and character escaping, only some of which may be - and with difficulty - controllable by the data provider.
  • To adequately express origin and context of search results, a three level structure was desired (single resource → data collection → data provider). The flat table data exchange structure uses resource types and id-based relations expresses this. The explanations for this structure turned out to be difficult and several partners had difficulties in following the agreement.
  • The providers have no quality control tool to adequately test the syntactical correctness of the exchange syntax, values used, or whether their data has the correct provider/collection/resource structure. Although such tools exist for xml, at the present time these are not widely available or even embedded in standard software.

In general, the process of manually following an export process, manually transferring data (email, ftp), and manual quality control and data conversion into a central repository is labor intensive. Since fully automated methods are already beyond the project partners abilities and in any event would would be a hurdle to interaction with new associated partners, we are searching for partial improvements of the process.

2 Details about old, plain table-based data packaging methods (used in first Key to Nature survey

Among the manually exchangeable file formats, the preferred transfer formats are:

  • Tabulator-delimited Unicode text files: Text should not be quoted, i. e. not surrounded by single (‘) or double (“) quotes, a tabulator (= character 09) should be present between the fields, the rows should be separated by a new-line character (consistently on of: hexadecimal 0A, 0D, or the 0D+0A combination). New-line character present in the text should be escaped by a defined mechansims, e.g. as “\n”. The first row of the file should contain the field names. Both UTF-8 and UTF-16 formats are acceptable, both with and without Byte-Order-Marks (BOM). Date values must be provided as “yyyy-mm-dd”, date/time values as “yyyy-mm-ddThh:mm.ss”, floating-point values using the “.” as decimal separator. Advantages: Highly compatible across different operating systems. Disadvantages: less control on type conversion of non-string fields (e. g. dates), testing necessary to guarantee that no non-Unicode application are used in the workflow.In the case of tab-delimited files, please make sure that your software does not add any quotes (“) around text. Please check whether your data contain new-line characters (carriage-return, line-feed), which is used in this format as the record-delimiter. Please add the fieldnames in the first row only; do not repeat them in front of each value.
  • Microsoft Access Database (Access 97 to 2007). Advantages: Unicode enabled; no problems with large field sizes, relatively widely distributed. Disadvantages: Requires Windows operating system.
  • Text documents containing tables in HTML or RTF format. This method is safer than the Excel transfer (see below), but more difficult to process.

Less desirable formats are:

  • Comma-delimited quoted Unicode text files with field names in the first row (“Unicode-CSV”). The comma delimitation either requires an escape method for comma in the text or quotes around text plus an escape method for quotes (which may occur in the text). These quoting and escaping methods vary (quotes may be escaped with backslash or by doubling the quote, text may only be quoted if it contains a quote, which in turn creates problems when the entire text is already quoted in the source data, etc.) and may require custom programming to deal with.
  • Microsoft Excel spreadsheets (97 to 2007). Advantages: Unicode enabled. Disadvantages: Requires Apple or Windows Operating Systems, may create difficult-to-detect problems with fields larger than 255 characters (Excel internally supports large text, but it may be difficult to convert such fields to a different format, they may become truncated or replaced with “########”.
  • Various xml formats. We will be able to process them, but note that different database or spreadsheet applications create xml data that conform to widely different and incompatible xml schemata. It may be desirable to agree on a common schema, or to provide a standard transformation for some common formats; this needs further investigation and agreement with the capabilities of providers. (Note on Oracle xml-exports: as of 2008-09, Oracle continues exports invalid xml. To fix this, open the file with a non-xml editor as UTF-8 text file and change "UTF8" to "UTF-8". The result can be processed and imported, e. g., into Microsoft Access.)

These formats are acceptable if a provider has problems delivering data in one of the preferred formats. The preferred format as of 2009 is, however, the wiki-based format described in Help:How to add resource metadata on the Wiki.

2.1 Quality control for Unicode character set (Plain table method)

Unicode characters may become corrupted if some application in the data processing sequence does not handle them properly. To help with the quality control of data transfer, each data transfer file should include a record that consist of the fixed value “UnicodeQC” for the Type field and the literal text: “«Unicode Test: ¿ŠšǍǎ – are S and A caron preserved?»” in the Title field (see example below, all other metadata fields may be empty). Please copy the test string exactly, including the guillemets, but not the English quotes using the clipboard or some other means. To control for your internal processing, please try to create this record as early as possible (at least if you suspect that you may have accented characters, e. g., in person names).

Type Title
UnicodeQC «Unicode Test: ¿ŠšǍǎ – are S and A caron preserved?»
Collection …   (etc.; i. e. all other records like Collection,
StillImage …   StillImage following the UnicodeQC record)


3 A Wiki alternative

We currently investigate and test a method, where the uploading of data is a manual process distributed among all partners, but where the system for uploading already reports certain quality problems. The central harvesting and integration of such data is then planned to be a fully automated process. Rather than writing custom software for a repository, we plan to use a MediaWiki installation, with the following setup:

Each data provider is represented by a wiki page, containing any free-form text and images, plus a required infobox containing a short name, logo and web-address ("homepage"/URL). In addition it must be marked with the "Category:Resource Metadata Provider".

For each resource Collection, another page has to be created, containing any free-form text and images, plus a required infobox containing important metadata about the resource collection. On the collection page, a table of resources is presented, which is created by another template.

The key to this approach is the use of templates (infoboxes or table-row-templates). MediaWiki templates provide the following opportunities:

  • they provide a visually attractive reporting of the data, facilitating data proofreading;
  • they may provide error reporting facilities (required fields, incorrect values);
  • they are relatively trivial to parse for metadata harvesting.

We believe the Wiki method will prevent misunderstandings of the necessary relations from resources to collections to providers. Being browser-based it usually provides a corruption of unicode characters, and it provides immediate feed-back on certain quality problems.

The workplan for implementing this in WP4 (media resources) is:

  • testing the approach with the WP3 (identification tool) data (work under way) loaded centrally on the wiki
  • writing a general user guide how anybody may follow these examples
  • test this with the second round of updating the WP4 data
  • write harvesting software that converts the Wiki metadata repository into an integrated, searchable format (using Fedora Commons or relational databases).

Depending on the technical expertise of the data provider, the process of uploading metadata to the wiki repository may be manual or automatic. A manual conversion from the local data formats can relatively easily be done with spreadsheet and work-processing software, and we will try to write a guide to this. Since the target format is essentially plain text (which is then copied into the browser-based wiki editor using the clipboard), it can be expected that many partners will be able to follow the instructions. However, this process can be fully automated, by writing data export routines and a simple web-script which updates the web pages automatically.

An essential point in this approach is that central error reporting can be automated. In a first step, the harvesting mechanism will simply ignore any data that are not fit for harvesting. The lack of the data in the search facility will already provide a primitive form of feed-back. In a second step, it can automatically add problem reports to the wiki pages it could not process.

Personal tools