Caution to FRL readers: this gets a little geeky. If your eyes glaze over after the second sentence, just skip it.
A year ago in Techsource I wrote a series about the problems with OPACs, and in the course of it wrote about relevance ranking. I said, quite accurately, that TF/IDF was a technology used for relevance ranking, and if I say so myself, I explained TF/IDF pretty darn well:
TF, for term frequency, measures the importance of the term in the item you’re retrieving, whether you’re searching a full-text book in Google or a catalog record. The more the term million shows up in the document—think of a catalog record for the book Million Little Pieces—the more important the term million is to the document.
IDF, for inverse document frequency, measures the importance of the word in the database you’re searching. The fewer times the term million shows up in the entire database, the more important, or unique, it is.
However, if I could revisit that article today, I would emphasize that “ranking,” done well in the Most Moderne Fashion, is a complex soup dependent on far more than TF/IDF, and I would include the opinions of people who believe — in some cases, based on real-world testing — that TF/IDF doesn’t work that well for ILS record sets, due in part to the inherent nature of citation records, which unlike full text are…hmmm… I am not sure how to put it, but citation records aren’t mini-representations of full text; they are a unique form of data. It can be dismaying to turn on relevance ranking in an ILS and discover that your results are close to nonsense — though it’s worth asking how much more nonsensical than OPACs that order items “last in first out.”
(At My Former Place Of Work Minus One, we tested a fancy search engine that had this really really kewl method for producing dynamic facets… why is it when vendors say “dynamic,” I start to twitch? Anyway, we turned it on and had to laugh. The kewl technology was based on word pairings that made some sense in full text, but for a citation index created nonsensical facets from phrases such as “includes bibliography.”)
Furthermore, record sets are not consistent within themselves. Cataloging practices change over time; practices even differ among catalogers or between formats. (Oh oh… did I just reveal a little secret?)
However, given that we don’t have full-text for records in most cases (though yes, I do think it would be great if we had that content to leverage), and given that TF/IDF is a useful technology, I would hazard — a good word in this case, since I don’t have any way to prove this — that some of the weaknesses of TF/IDF and ILS record sets could be at least partially remediated if record sets were broken up (or perhaps, marked up) along these lines of difference.
Most current-generation search engines (the Siderean/FAST/Endeca/i411/Dieselpoint set, to name five similar products; and I suspect Aquabrowser belongs here as well) can — unless I am greatly mistaken– provide a variety of ranking optimization. That’s a huge plus if you’re streaming together radically different indices. So an ILS record set could be broken up (not necessarily literally) along chronological and format lines, and each section indexed for its optimal needs, and then reassembled within the index. That, plus including other ranking methods — such as popularity — could go a long way. (I can imagine that it would be possible to do some processing on the fly, but that sounds expensive CPU-wise. There’s a reason we all hate federated search…)
(Speaking of popularity… a little caution, also based on experience: if you’re incorporating a “popularity” function in your OPAC, and you’re basing it on circulation… and you have a large e-book collection… see where I’m going? E-books don’t circ, so they don’t even get included in that metric. You may want to change that label to “most checked out,” change its value to represent “most browsed,” or skip it entirely.)
Of course, it would help if we didn’t have something quite as odd as a “record” to contend with… how slavish to convention, that a Thing has a Record… but that’s a whole ‘nother line of thought I have pondered since working with record sets a year or so ago.
Oh, and if you’re looking at Google Appliance for improving search functionality in your catalog… why? Google’s approach is almost vehemently anti-metadatical (is it redundant to say “that’s a new neologism?”); they worship at the cult of full text. Library data may be expensive to produce and annoying to work with, but you might as well look at tools that leverage all that structure.
Gee, that was fun to write! Geek out. I may move on to “why SRW depresses me…”
Posted on this day, other years:
- I Love ALA... - 2006
- Google Print: Worse than the Patriot Act! - 2005
- Vista: Software for Real Men - 2005
[…] Relevance Ranking and OPAC Records […]
[…] in which I described problems with ranking, spell-check, display, and other issues. (Here’s a follow-up post that links back to Techsource while correcting one point. In fact, if I could rewrite the original […]