Arti Kirch

Arti Kirch
SIMS 290 E-Publishing
10/5/98 Assignment: Project Research

Note: The group I am part of-Suffragists Speak-is researching technology that we feel our product should have, but that we don't necessarily have deep background in. I am investigating search engines: due to the educational basis of our product, it should support inquiry, not just present close-ended information packets. Thus, the articles below are a sample of the issues that seem the most interesting or relevant to what we are trying to do.

Deborah Lynne Wiley, "Beyond Information Retrieval: Ways to Provide Content in Context", for Database, August 1998.
http://www.onlineinc.com/database/DB1998/wiley8.html
The article discusses how the Web and its technologies have raised the expectations of information searchers. It is no longer sufficient to produce a list of thousands of items in response to a query that requires the requestor to identify which items meet their needs. In other words, the growing population of searchers wants answers, not numbers.
A brief discussion follows on the pre-Web quarter century of providing information which tended to emphasize raw Boolean searching, large data stores, and high prices-all of which contributed to keeping searching within a professional milieu. However, all of that changed in the 90's as computing and data storage costs plummeted and the Web made networking the norm.
Possibly the most successful search model to have emerged on the Web is the directory service popularized by Yahoo! Its success has helped "search engines...[recognize] the limits of the massive quantity and lack of quality of information on the Web. Hence, they are preparing a number of strategies for adding 'editorial context' to the data."
The article then enumerates some methods for creating value-added information. The activities that seem to have the most meaning for our product are:
Collaborative Filtering - this method provides "recommendations to a user based on what other users have done". Given that we want to support users who may be unfamiliar with women's suffrage or with history in general, providing some tips to other sources might make using/reusing the product more inviting.
Pattern Recognition - in this advanced feature, "the software uses small pieces of less accurate information that, combined together, give increasing precision. It operates by calculating the probability of seeing x if we see y, and then what is the probability of seeing z if both x and y are present, and so on." This method would seem useful for all types of users, essentially allowing anyone to get as refined as they wish while providing them information along the way to spark further inquiry.
Classifying and Clustering - "The important feature is to identify the key concepts within a document, then pull all the information on those topics together, displaying it in a way that the user understands." In our original proposal we wanted to create and open source metadata for use in any similar site. If we continue with that in our prototype, our own site will become more searchable using these algorithms.
Clifford Lynch, "The Internet: Bringing Order from Chaos", Scientific American, March 1997
http://www.sciam.com/0397issue/0397intro.html
This article intrigued me for the technology it suggested to address the author's point, which is that the Web is not yet a digital library. It "was not designed to support the organized publication and retrieval of information, as libraries are...The ephemeral mixes everywhere with works of lasting importance."
This point is not lost on our product as, unless our search feature returns information, Suffragists Speak may become just another curiosity of little educational value.
The problem could be solved by building/buying our own crawler. However, apart from the issue of needing to maintain the crawler and the resulting index (features not in our business model), "the Web...still lacks standards that would facilitate automated indexing." Further, given that our site is multi-media intense and users might want to search out other multi-media, "[a]nother drawback of automated indexing is that most search engines recognize text only... no program can deduce the underlying meaning and cultural significance of an image (for example, that a group of men dining represents the 'The Last Supper')."
Mr. Lynch then suggests the Harvest "gatherer", which also, upon examination of their web-site, is open source. Harvest "lets a Web site compile indexing data for the pages it holds and to ship the information on request to the Web sites for the various search engines." An obvious strength to this engine could be that it will support building a collection on "specific topics for specific uses and tie them loosely together so people can search and locate what they want.'' I am looking into this UNIX-only application.

Gus Venditto, "Search Engine Showdown", Internet World, May 1996
http://www.internetworld.com/print/monthly/1996/05/showdown.html
I was introduced to searching basics last year in IS202, but I wanted an article that refreshed my memory in order to see if any of the commercial solutions had anything our product should consider. What follows are excerpts of the article's review of seven search engines.
Alta Vista

Quickly returned search results that were "consistently more comprehensive than any of the other sites'. Even obscure references in little-known sites..."

"relevance ranking is clearly not the most effective of the seven engines tested"

"full-text index"; "algorithms seem to rely more on brute force than subtlety because simple queries often generated results for only one term in a phrase"

"does not support stemming,...so that all of its searches are performed only on the exact phrase, and not on the plural or other forms of the word"

"a date filter range"

Excite

"a Web search engine and Web directory organized by category."

"keyword search engine is most prominent"

"search[es] through the messages of about 10,000 newsgroups (roughly one-half the total number of groups), and a classified search feature lets you look through the contents of newsgroups dedicated to public ads. For example, you can search on a particular car model and you're likely to find a list of 'for sale' notices"

"Web index is a full-text database...The engine does not attempt to collect all the pages on the Web, rather it creates an estimate of the most popular sites by looking at links on sites known to be popular"

"does not display URLs in its list of results"

Infoseek

"allows you to type in a query that's as detailed as you wish and it will apply the logic in the background"

"full text of each page in the InfoSeek Guide database is indexed. Searches are case-sensitive (dramatically improving the effectiveness of searches on proper names)"

"proximity ranking...improve[s] the relevancy of its findings."

"search results ... include the title of the Web page, its URL, a relevancy score, the size of the file, and a computer-generated summary."

Provides a "Similar Pages command", i.e. refinement of the search by focusing on the terms in the first site's listing

Lycos

"often delivered the most comprehensive results...However, the size of each report often was overwhelming...generally did not find relevant information on the first two or three pages of Lycos's search results as often as" other engines

" builds its database cumulatively rather than rebuilding the database periodically...[by creating] abstracts of pages based mainly on headers, titles, links, and the first few words of key paragraphs--all of which is designed to maximize broadly relevant information. One result of this design is that the engine doesn't do well on searches for short references buried within documents."

"creates a measure of each site's popularity by looking at the number of other links pointing to a site." Also uses this, in part, for relevancy ranking

includes search statistics, "noting the number of occurrences for each search term" and numbers each citation

found search terms are "marked in bold, which makes it easy to scan through the pages."

OpenText

Several search opportunities:

• simple term;
• "power search," which can include up to five search terms and use any of five operators between terms (and, or, not, but not, near, and followed by);
• specify the field to search for each term (anywhere, title, summary, first heading or URL);
• create a weighted search, in which relevancy can be selected for up to four different terms in a search.
searches are limited to strings

WebCrawler

"can save search results pages as bookmarks, so methodically visiting the sites is fairly easy"

The article ends by offering suggestions for the ideal search engine. This discussion only higlighted for me that we need to think about what features we want to offer, i.e., depth, breadth, or maybe both.