Charles Engelke's Blog

July 11, 2003

Session: Building a Smarter Search Engine – Artificial Stupidity

Filed under: OSCON 2003 — Charles Engelke @ 1:18 pm

This side room has a few tables, so I can probably take more notes. (I
really wish I’d been able to take more notes for George Dyson’s talk earlier
this morning._ This talk is being given by Maciej Ceglowski, Aaron
Coburn, and Seth Raphael of the National Institute for Technology and Liberal
Education (NITLE), or will be, once they get the projector working. NITLE
is a consortium of liberal arts colleges.

Metadata can be misleading. He showed a map of locations of web logs
that had geographic metadata available for it, and there was a fascinating
pattern: there were a lot of Somali blogs! Who knew? And they were all
writing in German and Czech. There must be a huge expatriate community of
Czechs and Germans in Somalia. Well, the metadata had latitude and
longitude reversed; the blogs were in fact in Europe.

Computers can’t understand human language yet because it’s really hard.
But counting words and statistically analyzing their distributions is easy.
And it turns out that this approach is pretty accurate for grouping and
classifying documents. Documents are vectors in a many-dimensioned space,
where each dimension is a different content word, so you each document is a
point. His approach reduces the number of dimensions, typically by factors
of a hundred or more (10,000 down to 50).

This talk will be online; URL soon. That’s good, because the information
is very dense, and he’s racing through it.

A different approach from the vector space is to use a graph rather than
a vedtor space, showing how common words connect documents. The nodes are
documents, the edges, words. You can then search by taking your query and
finding documents that contain it exactly, then follow edges to nodes that
are close to several of the original documents, and find relevant documents
that contain the exact phrase you’re looking for. This is a lot less
computationally expensive than the vector space approach, and makes it easy
to add or delete nodes from the graph; you’d have to redo a lot of
calculations after any change.

More information is available at

He gives a demonstration based on tutorial chains, where the edges are
weighted by common attendees. If you signed up for a particular tutorial,
this search engine will show you similar ones you might also like.

They gave an interesting demo by searching for “tit-for-tat”. The first
documents returned include that phrase in the context of a game theory
strategy. But you also get a lot of documents about vampire bats that
don’t contain the phrase. But, when you read those documents, you
see that they describe vampire bat behavior that is a case of “tit-for-tat”
in their lives. That’s pretty neat.

This talk will be at soon.


Blog at

%d bloggers like this: