Blogging was where we began, and how we built our company so we have preserved this archive to show how our thinking developed over a decade of developing the use of social technology inside organisations

TQL : A Standard Syntax For Multi-Tag Queries

by

Introduction

Tagging is booming. Everywhere you look, it seems that these little free-text Post-Its have become the semantic glue of Web 2.0.

But therein lies the problem – the tech-heavy community of del.icio.us users tends to use the tag "java" to refer to the programming language Java, as is immediately apparent from the link above. The same tag on Flickr, however, shows lots of photos of the island of Java, and none relating to the programming language.

This is to some extent inescapable – a single tag is always context-sensitive, and taken out of that context, the inherent ambiguities of language become apparent.

These ambiguities can be resolved, or at least reduced, by using multiple tags to provide context. For instance, anything tagged with Java AND code is probably about the programming language, whereas something tagged with Java AND Indonesia probably refers to the place.

Or maybe not – according to Wikipedia, Java is also a type of coffee which originated on the island, and it’s also a term for the Javanese language. The combination of java AND indonesia could plausibly be used for both of these subjects aswell. java AND coffee should drill down to the coffee-related posts, but java AND language ? Well, which language? The programming language or the spoken language?

So maybe we need a third tag? Something like java AND language AND javanese. But then we are relying on any item about the Javanese language being tagged with all three tags. This may not be true in many cases – if a post is tagged with just java AND javanese, our three-tag query above will not return it. So how about java AND language OR javanese? But then we must be careful – does that mean that it must have java AND EITHER language-or-java, or does it mean EITHER java-AND-language OR javanese ?

There are some tools and APIs already starting to emerge which support combinations of tags – Del.icio.us supports tag “unions” via the + operator. Ultimate Tag Warrior, the tag plugin for WordPress, supports tag “unions” and “intersections” via the + and | operators. Technorati, on the other hand, uses a plain-text string of "tag1 OR tag2" in the URL, which gets URL-encoded to "tag1%20OR%20tag2".

So which is it to be?

A Proposed Solution

What is needed, therefore, is a generalised boolean syntax for tag URLs. A Tag Query Language, if you like. It must have certain properties :

Properties

  1. It should be able to be used in a URL – so no “/” characters, no “.” characters, etc.
  2. It should be human-interpretable – a person should be able to look at a TQL query string and work out what was intended without too much effort.
  3. It should be simple – Web 2.0 is showing us that the simple solution is usually the best, and something which requires too much time to implement is just not going to be widely adopted.
  4. It should be easily translatable into standard SQL
  5. It must not expose the implementing website to potential SQL Injection attacks

So what form would such a syntax have? Maybe something like the following:

Operators

  • A boolean AND is indicated by the plus sign +
  • A boolean OR is indicated by the pipe symbol |
  • A boolean NOT is indicated by the exclamation mark !
  • Ambiguous logic (e.g. “x and y or z” ) can be resolved by grouping with braces ( )

Rules

  • To satisfy requirement 5, any character which is not an alphanumeric or one of the above qualifiers should be stripped out.
  • To satisfy requirement 3 – the KISS principle –
    • braces must not be nested. That way lies a whole world of nastiness…
    • all operators have equal precedence after grouping with braces (). In the absence of braces, they should be evaluated in straightforward left-to-right order.

To use our example above, this would allow a query for items relating to java, the spoken language to be written as follows:

java+(language|javanese)

Example SQL

Assume that :

  1. tags are stored in a column tagcolumn in a table tagtable
  2. the items we want to return are stored in a table itemtable
  3. items are related to tags in a many-to-many : one item can have many tags, and a tag can apply to many items.
  4. this link is accomplished by an intermediary linktable relating itemids to tagids
  5. we can represent the selection criteria for an item by the placeholder myothercriteria

Then our query above would translate to something like the following:

SELECT (columns)FROM (itemtable)WHERE  EXISTS ( 			SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria 		WHERE tagcolumn = 'java' AND linktable.itemid = itemtable.itemid 	)AND (	EXISTS ( SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria 		WHERE tagcolumn = 'language' AND linktable.itemid = itemtable.itemid 		)  OR	EXISTS ( SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria		WHERE tagcolumn = 'javanese' AND linktable.itemid = itemtable.itemid  		))

Of course, this SQL is only an example, and is written for clarity rather than performance. There are many optimisations which could be done in this scenario, and many platform-specific options to be considered. For instance:

  • SQL Server allows the indexing of views, which could reap large benefits in this application.
  • Coldfusion allows the re-querying of in-memory resultsets using SQL syntax.

However, these optimisations are implementation-specific, and as such are beyond the scope of this article.

5 Responses to TQL : A Standard Syntax For Multi-Tag Queries

  1. By Chris L on February 5, 2006 at 3:14 am

    I was sure that I’d seen this before, right down to the TQL name– I’ve mentioned it in passing hundreds of times over the last year or so. But I guess I was just imagining something that fills an obvious need…
    It sure would be fantastic if a standard like this were picked up and used, even if only in addition to other search types and/or through the API (meaning, I could understand Technorati retaining x OR y, but still making a standards-based query available)…

  2. By Al Davidson on February 6, 2006 at 9:56 am

    Chris,
    Thanks for the comment – the term TQL has indeed been used before for different things, it was more of a suggestion than a definitive claim to the name…after all, AJAX has shown that you have to call it _something_, otherwise no-one will notice, right? 🙂
    But that was exactly the kind of implementation I had in mind – as a complement for existing formats. Much as people maximise interoperability by make feeds available in RSS / RDF / Atom formats, this would be another way to achieve the same goal.

  3. By Daniel O'Connor on February 14, 2006 at 1:05 pm

    See:
    What do HTTP URIs Identify
    http://lists.w3.org/Archives/Public/www-tag/2002Jul/0319
    Introduction to the Semantic Web with N3 notation
    http://www.w3.org/2000/10/swap/Primer
    What is RDF
    http://www.xml.com/pub/a/2001/01/24/rdf.html
    SPARQL tutorial
    http://jena.sourceforge.net/ARQ/Tutorial/
    … and try not to reinvent any wheels…

  4. By Al Davidson on February 14, 2006 at 1:38 pm

    Daniel
    Thanks for the pointers – from quick scan of them, it appears that we’re getting at sligthly different things here. The RDF and SPARQL articles are referring to structured data, where “things” have defined, distinct properties, and the SPARQL is for retrieving “things” by known property names. Conversely, both the power and the main drawback of social tagging appears to be that a tag is formless metadata – it could refer to any property or means of describing the “thing”.
    Fair point about not reinventing the wheel – I’m all for standardisation – this piece started from a middle-of-the-night observation that two of the standard-bearers of Web 2.0 (delicious and technorati) are already providing multi-tag URL APIs in different syntaxes, and wouldn’t it be a good thing if people agreed to provide these APIs in at least one common way.

  5. By Stephanie Booth on February 20, 2006 at 1:10 pm

    Don’t have time to read it all now (have to rush back to school to teach!) but I did a bit of thinking along those lines last year:
    http://climbtothestars.org/archives/2005/01/16/thinking-about-tags/
    I’ll come back 🙂