An old colleague of mine has persuaded me to release an implementation of an "inverted index"-based search library, written solely as MySQL Stored Procedures. Our combined work is now available on Google Code.
It's been a long time since my last post... I've been spending most of my time recovering (still!) and working on a new game project for some old friends. On the way, I've learned ActionScript 3, Flash, Papervision3D, Box2DFlash, and a whole slew of other fun stuff. As soon as the project sees the light of day, I'll be announcing it here.
I've also done a bit of consultancy work on the way which catalysed the stored procedure work I'm releasing here. As I mentioned in a previous post, I had an article published in php|architect Magazine in which I presented some simple code to do database searches in PHP.
The basic concept of this method is to index content in a database table by separating it into words and storing the locations of those words in a separate table. This is about the most basic form of search engine, other than the "wildcard search" that many database developers seem to use, unfortunately. Wildcard searches are incredibly inefficient and unscalable. The other alternative is the MySQL (and MyISAM) specific FULLTEXT approach, which I've never been particularly happy with.
The Inverted Index technique is nothing new, and certainly not rocket science. However, I find that many (esp. younger) programmers haven't heard of this method, and rely on external libraries, specific hacks (such as FULLTEXT), or usually the dreaded "string LIKE '%foo%'" construction which can stop a MySQL server in its tracks.
It's no substitute for a proper search engine library, such as Apache Lucene, but it does have the benefit of being easily integrated into other queries. The problem of searching documents is one thing, but sometimes you just need to search an address field, biography field, or something like that, while still searching on other columns as well. While some external libraries will allow integration with MySQL through UDFs, that adds a whole extra maintenance load and is also not usually possible on shared database servers.
The approach is fairly easy to implement in an application language such as Perl or PHP. I've been doing it for years. However, it's still involved quite a lot of setup and maintenance.
Instead, I've written an implementation that does everything within the database itself using triggers and stored procedures.
This allows the programmer to treat the data table as a simple table, and then call a single stored procedure to perform a search. More complicated queries can be constructed using the same data, but still keeping the same triggers in place to do the dirty work.
While this approach might not be the most efficient way of doing things, I think it's the most self-contained and simplest to use. It should work on all MySQL storage engines, and I imagine it could be adapted to run on other RDBMSes altogether.
Anyway, I mentioned the code I'd written to my old colleague, Stig Palmquist, and he felt that it was valuable work that could do with being released as an open-source project. Moreover, he was eager to contribute to the effort. So, it wasn't long before we started a Google Code project, and we've both significantly improved the code since. It's open-sourced under the Apache License 2.0.
We'd both value any comments and contributions you may have.