Xesam Search Specification
This page is part of the XESAM specification version 0.9 (also known as RC1). The stable and blessed version will be named 1.0.
Design Goals
Allow really simple use
Allow complex use with live updating searches and complex queries
- Be suitable for both "high level" toolkit bindings and "direct use" fx. from Python with glib mainloop integration
- Pass a minimal amount of data over the wire
- Generally be flexible to cater for different consumers
Designed for desktop search, not arbitrary queries in a relational database or general RDF graph
Allow all kinds of implementations from online services right down to find/grep/slocate-based ones
History
The xesam search api was originally proposed as two separate DBUS APIs, a Simple- and a fully featured Live- api. With time and lengthy discussions, it proved that a truly simple API would have extremely limited use. In a few efforts to save the Simple api we quickly converged towards the Live api. This is why the page name is XesamSearch90.
Terminology
query : a client side object in the form of an xml string as described in XesamQueryLanguage90
search : a server side object representing a compiled and accepted query
DBUS Names
The primary search engine of the current session, should own the bus name org.freedesktop.xesam.searcher on the session bus. The object exposing the primary interface should have the path /org/freedesktop/xesam/searcher/main and implement the interface org.freedesktop.xesam.Search as described below.
org.freedesktop.xesam.Search
NewSession (out s session)
Request a new session on the search engine. A session represents the connection to the search engine. They are used to spawn searches and to set and introspect session properties.
Return:session: An opaque handle to a Session object
SetProperty (in s session, in s prop, in v val, out v new_val)
Set a property on the session. It is not guaranteed that the session property will actually be used, the return value is the property value that will be used. Search engines must respect the default property values however. For a list of properties and descriptions see below.
Calling this method after the first search has been created with NewSearch is illegal. The server will raise an error if you do. Ie. once you create the first search the properties are set in stone for the parent session.
- The search engine will also throw an error if the session handle has been closed or is invalid.
An error will also be thrown if the prop parameter is not a valid session property, if it is a property marked as read-only, or if the requested value is invalid.
session: A session handle obtained via NewSession
prop: The name or the property to set, see the list of session properties for valid property names
val: The value to set the property to
Return: new_val The actual value the search engine will use. As noted above it is not guaranteed that the requested value will be respected.
GetProperty (in s session, in s prop, out v value)
Get the value of a session property. The server should throw an error if the session handle is closed or does not exist. An error should also be raised if prop is not a valid session property.
session: A session handle obtained via NewSession
prop: The name or the property to set, see the list of session properties for valid property names
Return: value the value of a session property
CloseSession (in s session)
Close the session and all child searches spawned by NewSearch on the supplied session handle. The server should raise an error if the session handle is already closed or does not exist.
- If a client exits without closing the session it is the responsibility of the server to clean up the orphaned session.
Implementation note: Proper clean up of orphaned sessions is easiely achieved by storing the dbus name of the caller in NewSession and listening for NameOwnerChanged signals where the name of a known session user changes to the empty string.
session: The session handle to close
Return: Nothing
NewSearch (in s session, in s query_xml, out s search)
Create a new search from a query. Returns a handle to the server side search object. If the session handle is closed or non existing the server must throw an error. Moreover, if there are errors parsing the query_xml parameter and error should be thrown.
Notifications of hits can be obtained by listening to the HitsAdded signal. Signals will not be emitted before a call to StartSearch has been made. Remember to release the search handle with CloseSearch when you are done with it to allow the server to free up resources.
session: A session handle obtained from NewSession
query_xml: A string in the xesam query language representing a well formed xml document
Return: search: An opaque handle for the Search object
StartSearch (in s search)
Start a search created by NewSearch. Remember to connect to hit notification signals before calling this method.
- The server should raise an error if the search handle is closed or unknown.
search: The search handle to start a search on
Return: Nothing
GetHitCount (in s search, out u count)
Returns the current number of found hits. This means that when SearchDone is emitted, this method will return the total number of hits on the search.
The server should raise an error if the search handle is closed or unknown. An error will also be thrown if the search has not been started with StartSearch yet.
search: The search handle to obtain a hit count for
Return: count: The number of hits the search engine has found at the time of the method invocation.
GetHits (in s search, in u num, out aav hits)
Get the field data for the next num hits. This call blocks until there is num hits available or the index has been fully searched (and SearchDone emitted). The client should keep track of each hit's serial number if it want to use GetHitData later, as the hit's offset is used as an identifier. See the section below for a discussion about the GetHits return value.
The server will raise an error if the search handle has been closed or is unknown. An error will also be thrown if the search has not been started with StartSearch yet.
search: The handle for the search to retrieve hits for
num: Number of hits to retrieve
Return: hits: An array of field data for each hit as requested via the hit.fields property
GetHitData (in s search, in au hit_ids, in as fields, out aav hit_data)
Get hit renewed or additional hit metadata. Primarily intended for snippets or modified hits. The hit_ids argument is an array of serial numbers as per hit entries returned by GetHits. The requested properties does not have to be the ones listed in in the hit.fields or hit.fields.extended session properties, although this is the recommended behavior.
The server will raise an error if the search handle has been closed or is unknown. An error will also be thrown if the search has not been started with StartSearch yet.
search: The search handle for which to look up data
hit_ids: Array of hit serial numbers for which to retrieve data
fields: The names of the fields to retrieve for the listed hits. It is recommended that this is a subset of the fields listed in hit.fields and hit.fields.extended
Return: hit_data: The requested fields for each hit. See the section about the GetHitData return type below
CloseSearch (in s search)
- Close and free a search. Closing your session also closes all searches in that session. The search engine is free to free up any resources related to the search.
- The server will raise an error if the search handle has been closed or is unknown.
search: Handle to the search object to free.
Return: Nothing
GetState (out as state_info)
Get information about the status of the search engine. state_info is an array of two strings. The value at position zero is one of IDLE, UPDATE, or FULL_INDEX. The value at position one is a string formatted integer in the range 0-100. In the case of IDLE the value should be ignored, otherwise it represents the number of percent the task is done.
IDLE, the search engine is not doing anything (other than maybe handling other searches)
UPDATE, the index is being updated
FULL_INDEX, a new index is being build from scratch. It should be noted that live search result set updates (when the property search.live == true) is likely to be erratic when the engine is in the FULL_INDEX state.
Return: state_info: An array containing the state information of the search engine.
Hit Ids
A hit is identified by the sequence number in which it was read with GetHits. Fx. the first hit retrieved will have the id 0. The 10'th hit retrieved will have id 9 etc.
Signals
The signals include the handle for the appropriate search. Language bindings (or direct consumers) can use dbus match rules to filter out irrelevant signals (from other xesam consumers' searches fx.). The HitsRemoved and HitsModified signals are only expected when search.live==True, but HitsAdded is always used regardless of search.live state.
Signal HitsAdded (in s search, in u count)
count is the number of hits added
Signal HitsRemoved (in s search, in au hit_ids)
- The hit ids in the array no longer match the query
Signal HitsModified (in s search, in au hit_ids)
- The documents corresponding to the hit ids in the array have been modified. They can fx have been moved in which case their uri will have changed.
Signal SearchDone (in s search)
The given search has scanned the entire index. For non-live searches this means that no more hits will be available. For a live search this means that all future signals (Hits{Added,Removed,Modified}) will be related to objects that changed in the index.
Signal StateChanged (in as state_info)
When the state as returned by GetState changes the StateChanged signal is fired with an argument as described in said method. If the indexer expects to only enter the UPDATE state for a very brief period - fx. indexing one changed file - it is not required that the StateChanged signal is fired. The signal only needs to be fired if the process of updating the index is going to be non-negligible. The purpose of this signal is not to provide exact details on the engine, just to provide hints for a user interface.
Session Properties
The values of session properties are expressed as dbus variants - mainly to allow lists of values for a single property. The types allowed for property values are string, integer, boolean, and arrays of said types.
search.live : If true the search will be persistent and notify for changes in the result set via the signals listed above. As noted under the GetState method the live updating might be erratic when the search engine is in the FULL_INDEX state. Type: boolean, default: false.
hit.fields : the metadata fields to return in GetHits. Type: array of strings, default: ["xesam:url"] [1]
hit.fields.extended : An optional hint to the search engine on which additional properties are likely to be requested for most hits. This can fx. be used to hint that snippets are likely to be requested. Type: array of strings, default: [].
hit.snippet.length : The length (in characters) of the hit's snippet (short excerpt of document content). This is only a hint and not guaranteed to be respected. Type: unsigned int, default: 200
sort.primary : The name of the primary property to sort after. This property must be included in hit.fields. Type: string, default "score".
sort.secondary : If the primary sort properties equal sort by this property. As with the primary property this value must be included in hit.fields. Type: string, default: undefined and up to the search engine.
sort.order : Possible values "ascending" and "descending". Type: string, default: "descending".
vendor.id : A string uniquely identifying the search engine. Type: string, defuault: "Unknown" (read-only)
vendor.version : An unsigned integer (to ease version comparisons) representing the search engine's version number. Converting the standard MajorMajor.MinorMinor.MicroMicro to an integer is left as an exercise to the reader. Type: unsigned int, default: "0" (read-only)
vendor.display : A short string describing the search engine suitable for display. Type: string, default: "Unknown" (read-only)
vendor.xesam : The version number of the xesam specification the vendor implements. The xesam versioning scheme is MajorMajor.MinorMinor. The version of this document is 0.9 which is mapped to the uint 90. Type: unsigned integer, default: 1 (read-only)
vendor.ontology.fields : A list of known/indexed metadata field names. Type: array of strings, default Undefined (read-only)
vendor.ontology.contents : A list of known/indexed metadata content types. Type: array of strings, default Undefined (read-only)
vendor.ontology.sources : A list of known/indexed metadata storage types. Type: array of strings, default Undefined (read-only)
vendor.extensions : A list of supported query extensions. See below. Type: array of strings, default [] (read-only)
vendor.ontologies : Ontolgies known by the search engine. See the section belowType: aas, default ["xesam", "1.0", "$XDG_SYSTEM_DATA_DIR/share/ontologies/xesam-1.0"] (read-only)
vendor.maxhits : The maximum number of hits retrievable from the search engine. This does not limit how many hits can actually be scored on a query, just how many it is possible to retrieve with GetHits. Type: unsigned int, default: Undefined (read-only)
[1]: The exact value of this property depends on the xesam metadata spec which is yet to be finished.
Field names vs Session properties: It is important to understand that metadata field names and session properties are not the same. Generally a metadata field is something that is stored in the search engines index and a property refers to some state stored with the given Session.
hit.fields Property, GetHits Return Value
The return value of GetHits and GetHitData is a sorted array of hits. A hit consists of an array of fields as requested through the session property hit.fields. Since the signature of the return value is aav a single hit is on the form av. This allows hit properties to be integers, strings or arrays of any type. An array of strings is fx. needed for email CC fields and keywords/tags for example. The returned fields are ordered according to hit.fields. Fx. if hit.fields = ["xesam:title", "xesam:userKeywords", "xesam:size"] (field names are defined in XesamOntology90) a return value would look like:
[ ["Desktop Search Survey", ["xesam", "search", "hot stuff"], 54367] ["Gnome Tips and Tricks", ["gnome", "hacking"], 437294] ]
Unset Fields: If a server encounters an unset field it should default to the following values, according to the field data type:
Boolean: False
Integer: 0
- String: ""
Float: 0.0
- Date: ""
It should be noted that the hit return type aav with this notion of unset values has been under much debate. The current form is based on the idea that this specification is primarily targeted for search and not semantically correct metadata storage. So the mantra is easy and simple to use, easy and simple to implement.
Unknown Field Names: If the server gets a request for an unknown field (via GetHitData or through the hit.fields property) it should return an empty string for that field.
Field Data Types: The data type of the returned dbus variant for each field, is partly determined by the ontology. The rule is that the returned data type should convert cleanly under standard conversions to the one prescribed by the ontology. Ie if the ontology prescribes that the server should return an integer for the xesam:width field then server can return it as a string, integer or float. Fx "10", 10, or 10f. Clients should not be harmed by this as most modern toolkits have provisions for doing dynamic type conversions. For example GLib has GValue.transform and Qt has QVariant::convert.
vendor.ontologies Property, Ontology Introspection
The session property vendor.ontologies is used to introspect which ontologies are known by the service vendor. A service owning the bus name org.freedesktop.xesam.searcher must know and respect the default ''xesam-core'' ontology.
An ontology reference is a triple of strings (unique_name,version,path) and the type of the session property vendor.ontologies is an array of ontology references - ie it has a dbus signature of aas.
An example shared online search service using Yahoo and Google as backends might have the following value for vendor.ontologies:
[
["yahoo", "1.0", "/usr/share/ontologies/yahoo-1.0"],
["google", "1.0", "/usr/share/ontologies/google-1.0"]
]The values of the ontology-triples (unique_name, version, path) deserve description:
unique_name: This is a name that uniquely describes the vendor of the ontology.
version: the version of the ontology
path: the absolute path to the ontology
Ontologies are installed in a directory under {XDG_USER_DATA_DIR,XDG_SYSTEM_DATA_DIR}/ontologies named <unique_name>-<version>.
FIXME: There should be some kind of metadata for the ontology itself such as a vendor name (the unique name as in the dir-name), ontology version, full vendor name (free form string), ontology description, ontology license. Whether this is stored in a separate file or embedded in the ontology itself (could be done in RDF/XML for example) is another matter to be decided later.
FIXME: We need still need consensus on the ontology representation format (RDF vs .ini)
vendor.extensions Property, Query Extensions
The xesam query language supports a number of optional extensions on top of the base language. A search engine supporting regular expression matching and fuzzy string matching should return
["regExp", "fuzzy"]
Simple Use Case
Retrieve a list of URIs matching a query:
session = NewSession() search = NewSearch (session, query) StartSearch(search) hits = GetHits (search, 1000) CloseSession (session)
Advanced Use Case
A live search doing non-blocking requests and hinting to the search engine that it will retrieve snippets for each hit:
session = NewSession()
SetProperty (session, "search.live", "true")
SetProperty (session, "search.blocking", "false")
SetProperty (session, "hit.fields", ["uri","dc:title"])
SetProperty (session, "hit.fields.extended", ["snippet"])
search = NewSearch (session, query)
<register signal handlers and match rules for the search handle>
StartSearch (search)
if HitsAdded(count):
GetHits(session, count)
<update ui>
GetHitData (search, hit_ids, ["snippet"])
<update ui with snippets>
else if HitsRemoved (hit_ids):
<remove all affected hits>
else if HitsModified (hit_ids):
new_data = GetHitData(search, hit_ids, ["uri", "dc:title", "snippet"])
<update ui with new data>
...
CloseSearch (search)
...
Resources
xesam-tools - Command line tool to search xesam services. It is implemented with GObjects in Python. In addition to the command line tool it contains a nice stand-alone xesam module for PyGObject.
bzr repository: http://grillbar.org/xesam-tools
