Xesam Protocol for Metadata Harvesting
This page is a draft.
This draft is heavily inspired my Philip's proposal for metadata harvesting from Gnome's email client Evolution mixed with some of the ideas found in the widespread standard for harvesting metadata online OAI-PMH.
Concepts
Target: Some application or service exposing a Xesam-PMH DBus interface allowing clients to harvest metadata stored in, or by, the target
Crawler: A specific component inside the harvesting clients that is responsible for aggregating the harvested data
Payload: A nugget of information harvested from a target. A payload stores metadata about one single item signified by the items URI
API
org.freedesktop.xesam.pmh.Target
ListRecords (t mtime, as fields, as content_cats, as source_cats, u max_batch_size, o crawler)
- Prepare a crawler for receiving records. The return value of this method determines whether the crawler will receive an incremental update or a full dump of the whole target
mtime : Records should have timestamps strictly greater than this value
fields : A hint to the target about what fields the crawler is interested in. The target may choose to emit more or less metadata than requested however
content_cats : An inclusive list of content categories that harvested items should match
source_cats : An inclusive list of source categories that harvested items should match
max_batch_size : A hint to the target to not pass more payloads than this number in one call to UpdateRecords. It is not guaranteed to be respected
crawler : DBus object path to the crawler to send metadata to. The server must look up the sender's bus name from the method invocation
Returns : A boolean. If the returned value is True the records pushed back via crawlers UpdateRecords() method will be incremental updates reflecting a delta compared to mtime. If the return value is False it means that mtime is too old and the target will not provide an incremental update but a full checkout of all data, meaning that the crawler will have to clear all data ever received from the target
org.freedesktop.xesam.pmh.Crawler
UpdateRecords (a(ssssta(ss)) payloads)
- Used by a target to send metadata to a crawler
payloads : A list of payloads. The scary signature is described in the section About Payloads below. The payloads must be ordered according to the timestamps in the payload headers - both in the array of payloads itself, but also in between calls from the same caller.
If the crawler wish to pause crawling it should simply block this call until it is ready. This means that targets should never ever do a synchronous DBus call on this method
Returns : Nothing
Invalidate (s debug_message)
Activaly invalidate the crawler, invoking this method forces the crawler to emit an Invalidated signal. If this method has been invoked by the target it means that no more records will be send to the crawler. A new crawler should be installed if resumption is needed
debug_message : Possibly empty string with a message from the target specifying the reason for invalidating the crawler
Returns : Nothing
CleanUp ()
The target has found that the mtime used when installing the crawler was too old and that the crawler will receive all data the target contains, right from the beginning of times. This means that the crawler should clear out any data it has already received from the target. This method will only ever be called as the first method on the crawler - to be absolutely clear, this method will not be invoked if updates has already been received with UpdateRecords.
Returns : Nothing
[Signal] Invalidated (s debug_message)
Emitted either as a response to an invocation of Invalidate or just because the crawler is not interested in more data from the target. The target should free any resources associated with the crawler when receiving this signal
debug_message : Possibly empty string with a message from the crawler specifying the reason for the invalidation
About Payloads
A payload has the DBus signature
(ssssta(ss))
Conceptually the payload is split into
Header: sssst
Body: a(ss)
Payload Header Explained
As ordered by the DBus signature sssst:
uri: The URI of the subject of the payload
timestamp: ISO 8601 timestamp the subject of the payload was updated
content_cat: The content category of the subject
source_cat: Source category of the subject
state: One of
- 1 : NORMAL. The subject can be expected to be "alive and kicking"
2 : DELETED The subject has been purged from the target. In this case content_cat, source_cat, and the body (as described below) may be empty
Payload Body
The payload body is just and array of string pairs, tuples of (field_name, value).
Why a Push Based Solution
It might seem more natural for the API to be pull based, just like ordinary OAI-PMH, where ListRecords would be used to page through the updated items. The idea of installing a crawler into the target and then have the target push data into the crawler is to enable a more lightweight system for iterative updates.
Consider three different apps wanting to harvest metadata from an email client - preferably these apps want real time updates as emails trickle in. With the API of this spec the situation would be as follows:
- Apps A1, A2, and A3 all install crawlers in the email client M
- For the initial run M picks the crawler with the lowest mtime, say A1, set and starts feeding it records with increasing mtimes
- When the mtime has come up to a point where there A2 and A3 is also interested they are fed records directly from the same pipeline as A1 - no need for additional cursors on the DB for M
- All crawlers are updated - idle time
- When M receives new emails they are immediately pushed into all active crawlers, no DB cursor needed on M's behalf
