This is more mutterings about search engines and improving MPOW, inspired by a day of grant-writing. (The grant is due at the fiscal agent next Tuesday.)
Some of you newer to this blog may have missed my fevered descriptions of how to improve search in a content-sparse metadata database such as MPOW. As I mapped out how my mad scheme for improving search in MPOW would work, I wanted to address Genny’s concerns about processing overhead.
Take a search in MPOW for chocolate cake. Pass the search through MPOW, then conduct the search in the wild (as in, er, you know, Google). O.k., I know, 3.3 million hits! But then take the first, oh, I don’t know, 200 matches, and match MPOW results against these. This is where it gets fruit-juicy good, because now the user typing chocolate cake in MPOW retrieves fabulous websites such as Recipe Source. That’s really what the user expects from MPOW: put in a simple query, get back sites matching their results, but websites they can trust.
Now back on MPOW process results in this order: MPOW hits; hits representing MPOW/Wild Web matches (deduped); draw a line, and present “continue searching outside MPOW.” Below the very top hits from MPOW would be the general collections we are known for.
For second-tier MPOW results, show the entire site, or specific items, or a combination? Maybe a link to “your match here?” I’m not sure; that sounds like it needs lots of thinking and usability testing. What I am clear about is that something along these lines would bring the user search experience from 1992 to 2005 and beyond. I also think this is far more feasible and dynamic than caching content or similar predictive schemes. You just don’t know when the pope is going to die or when a tsunami is going to hit.
It’s not making a 100% match against the Web, and it would show dupes below the line unless someone has a bright idea for deduping the Wild Web. But it’s gotta be a vast improvement for MPOW searchaliciousness.
Ah. Blithely she parenthesizes, “(deduped).”
First of all, on what will you dedup? Base URL before any forward slashes? Entire URL? Will you list YPOW’s second result for chocolate cake (http://www.hersheyskitchens.com) as well as the third Google result (http://www.hersheys.com/recipes)? This one’s actually a little interesting because the YPOW result URL turns out to be a redirect to the Google result URL.
Yes, each search individually returns results pronto. But the results for both now need to be stored, inspected for dups, and reordered before display. By limiting yourself to the first 200 you do make this much more manageable. Still it would seem to add not only to processing time but potentially significantly to storage requirements (depending on how many concurrent searches are being run at any given moment).
One item you might consider is storing each result set from the “foreign” search engine for, say, a rolling 24-hour period, so that you don’t keep rerunning the same search. When something hits the headlines I assume you get a number of searches for the same thing over and over, much as portrayed in Google Zeitgeist http://www.google.com/press/zeitgeist.html. If you store these result sets you again increase your storage requirements, but you reduce processing load and you don’t use up your Google API quota as fast 😉
Another item for thought: Suppose you only grab results from an outside search engine when the POW retrieves zero? Or fewer than X?
For example, “ivory-billed woodpecker” retrieves only two. Speaking of which, back to the zeitgeist thing, if YPOW could “notice” that it’s getting a lot of queries on the same term this week, maybe it switches over from searching Web to searching News?