Skip to content

Relevance Ranking and OPAC Records

Caution to FRL readers: this gets a little geeky. If your eyes glaze over after the second sentence, just skip it.

A year ago in Techsource I wrote a series about the problems with OPACs, and in the course of it wrote about relevance ranking. I said, quite accurately, that TF/IDF was a technology used for relevance ranking, and if I say so myself, I explained TF/IDF pretty darn well:

TF, for term frequency, measures the importance of the term in the item you’re retrieving, whether you’re searching a full-text book in Google or a catalog record. The more the term million shows up in the document—think of a catalog record for the book Million Little Pieces—the more important the term million is to the document.

IDF, for inverse document frequency, measures the importance of the word in the database you’re searching. The fewer times the term million shows up in the entire database, the more important, or unique, it is.

However, if I could revisit that article today, I would emphasize that “ranking,” done well in the Most Moderne Fashion, is a complex soup dependent on far more than TF/IDF, and I would include the opinions of people who believe — in some cases, based on real-world testing — that TF/IDF doesn’t work that well for ILS record sets, due in part to the inherent nature of citation records, which unlike full text are…hmmm… I am not sure how to put it, but citation records aren’t mini-representations of full text; they are a unique form of data. It can be dismaying to turn on relevance ranking in an ILS and discover that your results are close to nonsense — though it’s worth asking how much more nonsensical than OPACs that order items “last in first out.”

(At My Former Place Of Work Minus One, we tested a fancy search engine that had this really really kewl method for producing dynamic facets… why is it when vendors say “dynamic,” I start to twitch? Anyway, we turned it on and had to laugh. The kewl technology was based on word pairings that made some sense in full text, but for a citation index created nonsensical facets from phrases such as “includes bibliography.”)

Furthermore, record sets are not consistent within themselves. Cataloging practices change over time; practices even differ among catalogers or between formats. (Oh oh… did I just reveal a little secret?)

However, given that we don’t have full-text for records in most cases (though yes, I do think it would be great if we had that content to leverage), and given that TF/IDF is a useful technology, I would hazard — a good word in this case, since I don’t have any way to prove this — that some of the weaknesses of TF/IDF and ILS record sets could be at least partially remediated if record sets were broken up (or perhaps, marked up) along these lines of difference.

Most current-generation search engines (the Siderean/FAST/Endeca/i411/Dieselpoint set, to name five similar products; and I suspect Aquabrowser belongs here as well) can — unless I am greatly mistaken– provide a variety of ranking optimization. That’s a huge plus if you’re streaming together radically different indices. So an ILS record set could be broken up (not necessarily literally) along chronological and format lines, and each section indexed for its optimal needs, and then reassembled within the index. That, plus including other ranking methods — such as popularity — could go a long way. (I can imagine that it would be possible to do some processing on the fly, but that sounds expensive CPU-wise. There’s a reason we all hate federated search…)

(Speaking of popularity… a little caution, also based on experience: if you’re incorporating a “popularity” function in your OPAC, and you’re basing it on circulation… and you have a large e-book collection… see where I’m going? E-books don’t circ, so they don’t even get included in that metric. You may want to change that label to “most checked out,” change its value to represent “most browsed,” or skip it entirely.)

Of course, it would help if we didn’t have something quite as odd as a “record” to contend with… how slavish to convention, that a Thing has a Record… but that’s a whole ‘nother line of thought I have pondered since working with record sets a year or so ago.

Oh, and if you’re looking at Google Appliance for improving search functionality in your catalog… why? Google’s approach is almost vehemently anti-metadatical (is it redundant to say “that’s a new neologism?”); they worship at the cult of full text. Library data may be expensive to produce and annoying to work with,  but you might as well look at tools that leverage all that structure.

Gee, that was fun to write! Geek out. I may move on to “why SRW depresses me…”

Best practices for managing virtual workers

Yes, it’s another opportunity to weigh in on interesting issues *and* get you or your organization mentioned in an online journal read by IT managers in and out of LibraryLand!

For an upcoming article, I’m writing about best practices for managing/supervising virtual employees or contractors (sometimes known as satellite workers). Before you say “but we don’t do that,” consider that if you have ANY telecommuting activity in your organization — for example, working from home every other Friday, working the occasional day from home due to emergencies, or even authorizing an occasional case of what I call “report-writing flu” — then you are managing virtual workers.

Likewise, if you are the tethered IT person who occasionally finds herself crouched in a conference hallway thousands of miles from the library, negotiating a server restart, then you’re a virtual worker, too. How do you manage *yourself*?

Some possible questions: how do you communicate expectations to these workers? How do you ensure the work is done? What tools do you use to communicate with distance workers (for example, instant messaging, Skype, email)? What has worked best (and maybe, not so good)? Has virtual employment changed workplace expectations of what ‘work’ is about—making it more outcome-based? Do you need to set boundaries so workaholics don’t grind themselves into the ground?

Input due by COB (however and wherever you define “close of business”!) this Friday, July 27… thanks!

My Techsource Post about Dewey

I have to say when I hit “publish” for my Techsource post about post-Deweyfication last night I had no idea it would have 8 comments by this morning. I attribute that to Jessamyn‘s link love, and thanks, gal.

As Dorothea over at Caveat Lector notes, the Wall Street Journal’s coverage of the post-Deweyfication of the Perry Branch at the Maricopa County Library District was spot-on. I always feel a little weird about praising an article that quotes me… “My name is Karen, and I approve this message.” Jessamyn is so better about this… I know none of you think of me as shy, but I need to learn from her.

My Techsource post writes about two parallel innovations: the Perry project, and using bookstore headings (BISAC) in an online catalog at the Phoenix Public Library. (I started to write “BISAC in an OPAC,” then thought, get thee away, acronyms!). I find it a reflection of the feudal nature of libraries that two systems in the same county could be working in parallel on eerily similar projects and not know it.

Writing for the Web

An all-day writing workshop for the Panhandle Library Access Network. We’re going to have fun!

2.0 at Williamsburg Regional Library

A staff presentation, and a consultation. This should be fun — and such a beautiful area!

Death to Jargon

One-hour OPAL presentation (online) for system in Wisconsin.

Symposium on the Future of the ILS

Lincoln Trails Library System, Champaign, IL (hello, alma mater!). I’m giving a talk and then I’m talking to trustees on Saturday. I hope I can see GSLIS… and a couple of friends in the area!

FRL’s Blogiversary: Today We Are Four!

Today Free Range Librarian turns four years old!

Now, today is the official blogiversary — the day I first put a post into a Movable Type blog I had installed myself.  I entered three posts in July, 2003 (two of which I had written and previously posted elsewhere), but I didn’t blog again until November, 2003, which has 36 posts and therefore with this post (quoting Ted Hughes, no less) really marks the beginning of my blogging… uh? Career? Lifestyle? (Or maybe I can’t help it… I was just born that way?)

In reviewing the posts for the last four years, it struck me that the months where I posted the most were not the months where I was the least busy — in fact, they were often when I was extremely busy — but were the months when I was happiest. I’m a talker (on paper, anyway), and chattering away is a sign I’m happy.  Also, writing breeds writing: the more I write, the more my mind is in the writing mode, and the more inclined I am to set words down and share them with Gentle Readers from many walks of life.

Anyway, thanks for sharing yourselves with me these past four years. (It’s become so much easier to manage comments since I moved to WordPress… oy, night and day!)  Your readership has been a nice constant in my life.

Not a creature was stirring…

Someone asked why PUBLIB (the discussion list for public librarians) was so quiet this weekend. A few people tore themselves away from HP7 (as some refer to the last Harry Potter book) to say, in essence, “Dude, we’re reading.”

As of Sunday, I was number 231 on the library reserve list, which is entirely my fault. For the last three months I’ve been telling myself, “Next year in Jerusalem!” I have a long, long list of stuff on my “git list” for when I’m working on a regular basis again. Harry Potter is on that list, but I could have reserved my library copy much, much earlier, rather than waiting until Sunday.

Also on my list: a dining-room table; a tiny house; a Mini-Cooper; a squirrel-proof cardinal feeder; a facial; a trip to New York; any number of books that are quite available at the library; a small flat-screen television for the sun room, though only our tabby cat uses the sun room on a regular basis… In other words, nothing I really need (we have a table pressed into service in the dining room, in case you’re wondering, and the real reason we don’t have a new table has more to do with having left the land of plenty and then realizing that we and the South have different opinions about furniture, much as we do about Mexican or Chinese cuisine).

The “git list” is in part a relief valve. I crave something, I put it on the list, and the craving goes away. But the “git list” also functions as a gentle reminder that my professional life is up in the air. I have some possibilities (in the tradition of The Veil, you will never hear about the ones I don’t get); I’ve also had some “sure bets” fall through; I’ve had some nice surprises come my way. I also have some contracting work up in the air (thank you to those who could be understanding this way) as I wait for resolution.

Maybe I didn’t put myself on the list for Harry Potter earlier because it felt too much like admitting that resolution was not at hand; and conversely, maybe putting HP7 on reserve Sunday morning (as I hacked and sneezed through a sudden summer cold) was important for reminding myself that we know how to live on a shoestring. Sandy was unemployed for our first full year in California, and even in that overpriced state, we survived.

We planned that I would be unemployed when we relocated to Florida; we just didn’t know the sequence of events would be “work, unemployment,” versus “unemployment, work.” I learned quite a bit about myself this past year — not the least of which is how important it is to plan for contingencies in the first place. But another lesson is the importance of patience and hope. I have to believe things will happen for me… what else is there to go on?

ASERL Presentation: Reclaiming the Heartland

’tis a bit oblique without my voiceover, but gives you some idea of what I discussed at ASERL‘s “Age of Discovery” conference yesterday. (For example, it doesn’t include my explanation that after Dewey and Cutter had a fallout, Dewey got the big numbers and Cutter got the small ones.)