WP3/A Wiki for a Community-Based Repository and Registry

From KeyToNature
Jump to: navigation, search

(Note: the following is a first draft to propose an open, widely supported community resource as the basis for developing more advanced applications on top of it).

By G. Hagedorn, 2008-07-07

Registry and Repository for organism descriptions, media, identification tools and datasets

We envision a Wikipedia-like community project, where a super-institutional and super-national community has a sense of self-ownership and can make decisions without referring to programmers at the place hosting the software. The community should include the “tents” of individual researchers or software programmers (so many small biodiversity software or content projects exist!) as well as the “castles” of the big institutions or museums represented in TDWG or GBIF. The community would be sponsored by ALA, CBIT, EoL, Key to Nature and others.

Why would this not be Wikipedia itself?

  1. Fighting the anonymous user as well as the administrator with a different preferred idea in Wikipedia is hard work. Our own community could limit editing to recognized users.
  2. Wikipedia has strict rules of relevance, that strongly differ from the rules that we would like to see.
  3. We are in control of the software and can add extensions as needed, including new extensions to write or play keys directly inside the software (either by going into the big cohesive IdentifyLife system, or by using third-party tools).
  4. we can be more flexible with respect to licensing and uploading content.

To make this work and accepted, it would be important that it is not experienced as someone else's ownership. It requires a neutral name, strong endorsement from many sides of the community, not only from Australia, and preferentially significant prestige (which EoL could supply ...).

Principles of the approach

  • Put the Wiki first. Turn the DELTA model (structured with free-form text embedded) around: Socially controlled free-form text first, with embedded structure.
  • NOT programmer-driven, NOT Big Design Up Front. Let content come in and let the programmers pick up what can be harvested from what is there. (Do help the programmer, see infoboxes below).
  • No non-extensible fields, categories, or layout. Structure it as part of the community, not as its masters.
  • Give people as much control of design and self-branding as possible (do provide examples, help, checklists). Let them design the pages with their own logos (on the article space). Give them incentives something to see and to show. If we succeed in that, contributors may actually care to update their information (which typically they otherwise don't after answering yet another online-questionnaire).
  • Use precisely the community tested MediaWiki software at the core. Leverage on the solutions that a similar community has already implemented (see separate list of arguments below). That does not mean that other software may not be necessary in support.
  • Keep it open for different forms of data and different software. The currently planned IdentifyLife system would naturally be integrated most smoothly (because money and love went into it), but any third-party

Infoboxes and markup

MediaWiki supports a form of free-form text “fill-in-forms”. These do have some problems, but also a lot of potential.

An example for a box providing metadata of a software tools is http://en.wikipedia.org/wiki/Mozilla_Firefox. Click on “edit” to see how the box looks during data entry. It can be freely edited and is preserved by bots and social control.

The same method is used for the taxoboxes in the Wikipedia. See, e. g. http://en.wikipedia.org/wiki/Phallus_indusiatus for a taxobox (and also a morphobox, although I am not sure the latter is a good approach). The links to the templates that contain the programming are http://en.wikipedia.org/wiki/Template:Taxobox http://en.wikipedia.org/wiki/Template:Mycomorphbox.

What is relevant here is:

  • an Infobox offers a fairly structured input form, that must be learned but can be learned by example (judging by its success in Wikipedia).
  • an Infobox is easily parsable by the next step in the pipeline (e. g. the IdentifyLife online key and character ontology builder harvesting the information for its purposes). The field names and syntax is controlled, or an error will appear after saving the page.
  • an infobox creates fancy layout (incentive) that is readable and will be read (good for data quality and not true for many html-forms-based data entry forms that don't give good reporting at the end).
  • an infobox can output further machine-readable content, especially categories and – as in the taxobox – even a microformat markup!

The larger vision

I do ultimately, one the same principles, envision something larger. I do envision something that contains taxon descriptions as well as character and state descriptions.

However, even if I elaborate that in the following, let us keep this separate from the dataset and software registry, which we currently discuss. Please read the following only if you are interested in this vision, not for the current discussion itself.

Characters

Similar to taxa, characters could be developed on the Wiki. The Wiki supplies many functionalities helpful in a human-to-human communication that are very hard to achieve in a custom-designed character editor. Character definitions can be rich, formatted text, with hyperlinks, media, tables, references to literature and web resources. Versioning comes for free.

At the same time, using the same techniques of infoboxes or single-item markup, they can contain parsable relations to other characters, character sets, etc. The character/state descriptions are thus available for re-use as definitions out of their original context (i.e. the definition is a community action based on MediaWiki, but they can be used in any number of identification tools and players/builders for IdentifyLife).

Because of the service oriented architecture of the MediaWiki, it is possible to edit character relations both in IdentifyLife special forms and in the Wiki: OAI-PMH based harvesting of the wiki propagates changes in one direction and a MediaWiki bot could update changes to the wiki in the other direction.

Using Semantic Media Wiki (SMW) extension, the RDF/OWL comes even free-house without parsing – and in a way that is understandable by a biologist (this is personal, I have a hard time with protege...). SMW is very young, and clearly imperfect, it has a lot of promises. It may not deliver or scale, but note that the plan does not depend on it.

Taxa

We can have the whole taxonomic tree in the Wiki as Wikipedia is doing anyways. I believe the natural place to submit or register a key to a taxonomic group is its taxon page, not some special key-registration page.

What we can do specially here, is to have “authored subpages”. With this I mean that within a taxon page, say “Aus bus J. Smith & Jones”, one could have sub pages like “Aus bus J. Smith & Jones/Description of specimen B 32478 by B. Bernhard” or “Aus bus J. Smith & Jones/Ultrastructural images by E. Temophile”. One could have rules for such subpages specifying that only minor obvious errors may be corrected, and all other changes may only be discussed on the discussion page.

The main article would minimally act as link lists to sub-articles, but ultimately could be extended into a collaborative article.

The taxon articles I dream will have a minimal structure by the wiki-markup, but will otherwise be relatively free with blocks of more structure embedded. The headings should not be pre-specified by a list of possible heading (which may be fine for plants, but fail to capture Viruses...), instead the needs of the users could be analyzed and the “data block” ontology can be developed as needed. It is both possible to have block identified through their heading and blocks within a heading (using template markup again, see the description block (highlighted green for the purpose of an example) in the dummy example taken from the Flora of Australia, http://en.wikipedia.org/wiki/User:G.Hagedorn/Macadamia_ternifolia_F.Muell._sensu_Flora_of_Australia). The colors are only for demonstration. Please go to the edit view to see how structure can be achieved inside an unstructured shell. Thus, an unstructured Wiki could provide the blocks of information the EoL desires to aggregate.

The example shown above shows that it is possible to markup individual characters in a descriptions. Using the Semantic Media Wiki extension, this can be directly turned in RDF/OWL. However, it is likely that to create highly marked-up descriptions as well as multi-access or single access keys, special extensions that provide special editing support are desirable. As shown in the LepTree editor, it is possible to provide natural language descriptions, that – using Ajax and dhtml – can be edited on the spot.

For small identification keys (especially for single-access, dicho-/polytomous keys) it is possible to use wikipages rather than file attachments. For example, a special editor extension could work on the key embedded on the “Aus Smith & Jones” article and store its changes in “Aus Smith & Jones/Key to Species” page, providing all the versioning and change management (albeit probably only in raw text form initially).

Registry versus Repository

Keeping data in the in the communities that are prepared to support and maintain them and mirroring this data in a synchronized series of caches is a good approach, but is generally limited to large institutions. Based on a simple Wiki, the following options are possible:

  1. The Wiki only registers the dataset and the parameters that are required for access. Where security information is necessary, this would have to be exchanged through separate channels, however.
  2. A dataset is registered for harvesting, with specifications whether the data may be archived permanently or not (should the originator cease to exist)
  3. Data are simply uploaded as documents where this is practical (any format can be possible, registration to the project is confirmed and vandalism limited socially).

What we need

  • Trust and wide involvement.
  • A partner trusted in long-term commitment and prestige to run the data storage for a long time. This can only be a museum-like institution, not a project.
  • Prestige and advertisement. Engagement by the inner partners to bring in initial content, bring in content-prestige.
  • A good, perhaps journal-like title. Publication on the resource could be endorsed (similar to Wikipedias Excellence marks) to marked it as being reviewed or proofed. Perhaps such articles could be cited differently.
  • Money to tutor new users, advertise the facility. Money to write tutorials, like "How do I create a Flora/Fauna/Just-Key?" CBIT or XPer2 can advertise here, have portals to help new users.
  • Money to program and test the harvest pipeline: downstream users (CBIT/IdentifyLife, ALA, EoL) are harvesting the pages (MediaWiki does have and OAI-PMH extension by now) and parse the information into their databases.



Software choice

Currently only an argument for using MediaWiki is made, but further sections could be added.

Advantages of MediaWiki

  • We can leverage on the knowledge of many more users (most people have edited in Wikipedia at least a few times, very few know TWiki syntax and handling, especially the “web” metaphor is rather confusing).
  • In contrast to TWiki, MediaWiki supports relatively user-friendly naming of article and does not force user or pages to blank-free CamelCase titles. Titles can be international (not limited to English!)
  • MediaWiki separates the content from discussions on the content
  • MediaWiki has permanent and stable versions.
  • We can leverage on the user friendly and time-tested "how to" documentations of Wikipedia.
  • Wikis replace technological control with social control (although the IdentifyLife wiki has a lot of access control, which deeply frustrated me when preparing for the meeting, about half of the pages I tried to read where read-protected). In a self-governing community the rules of conduct are essential. By pointing to Wikipedia rules as default (but allowing, whenever necessary to create different rules) we can start with a workable set of rules.
  • The resource registration process can be made community friendly on a MediaWiki software. The Wikipedia community has developed solutions we can build on as well as improve. The folksonomy methods in wikipedia ("categories") combined with advanced templating ("Infoboxes") do offer a chance to combine freedom of use and extensibility with control through humans and bots, and harvesting the information into relational databases or triple stores.
  • MediaWiki has highly useful extensions for our purposes, some of which are not installed at the wikipedia (e.g. Dynamic Page List)
  • MediaWiki supports bots (for a long time) as well as a new RESTful API so that a service-oriented community architecture is possible
  • MediaWiki does support a new OAI-PMH extension (OAIRepository extension).
  • MediaWiki has a semantic web (RDF/OWL) extension (SMW), which is the first software where I am starting to believe that biologists may indeed be able to express knowledge that can be re-used as RDF/OWL and processed as semantic web content. Bob Morris has already tested it and is impressed as well. It is currently not as powerful as one may whish (e.g. relations can not yet be defined as transitive), but then it is more a glimpse into the future rather than a requirement for the immediate purpose of the repository.

Potential problems

  • Pages can be locked for editing (admin-only), but not locked by page-group for editing by user-group.
    However, this reflects the wiki principle, to replace software control with social control and can be very flexible. Certain kinds of edits may be welcome from everyone, other edits only as specified by the authoring team.
  • The real problem are pages that should not be readable. Although MediaWiki does have mechanisms to take down pages or versions completely, to be viewable only be specific admin-groups, this does not extend to the normal user space.
    In KeyToNature we work around this by blocking all pages where the name ends in "_confidential" or "_Confidential". Such pages are visible only to users who have signed in and could be managed separately.
  • MediaWiki is best at large cohesive worlds, with gardens, but no hard separations. For certain groups that desire strong separation, separate Wikis may have to be created.
    Well-tested tools to migrate content between multiple wikis, including full version history, exist to allow some degree of flexibility should multiple wikis (analogous to TWiki webs) be necessary.

Experimentation

Charter Draft