Entry
Is there a HTML search engine written in Python?
Jun 15th, 2000 00:59
Matthew Schinckel, unknown unknown, Dale Strickland-Clark, Michal Wallace, Robert Roy, JRHoldem
If you're running it on NT, there's a free search engine you can tap
into that comes as part of the NT 4.0 option pack.
Otherwise:
Check out http://ransacker.sourceforge.net/ .. There's an Index
class that lets you index arbitrary chunks of text.. But you'll have
to write the program that actually reads the HTML files (and strips
the HTML tags, if that's what you mean by "text content")...
It also does a ranked searches, but you'll have to wrap that, too, if
you want the output to show up on the web.
A full featured full text indexing solution is not trivial. It all
depends on what kind of queries you want to perform. If all you want
to do are queries such as "find all files which contain the word
'dog'" that can be done quite easily, probably under 200 lines of
code for a trivial solution using sgmllib and gdbm. However if you
want to do phrase searching or stem searching or wild-card searching,
then it gets really complicated in a hurry.
Another factor is how many files you are dealing with. Indices often
run 4-8X the size of the indexed files. And do you want to dynamically
update the index or are you happy just re-indexing the whole works
periodically. A static index is somewhat easier to build than a fully
dynamic one.
An interesting GPL'd indexing package is SWISH++
see:
http://www.best.com/~pjl/software/swish/
A good tactic might be to use this for your indexing, and running the
search engine as a daemon, building a python interface to talk to it
via Unix domain sockets or alternately shelling out and capturing and
parsing the return values.
You also might want to try using Index Server/ASP combo before going to
any third party solution...full text searching is no trivial matter and
chances are it'll give you all the tinkering options you could want.
Additionally:
There is a really simple search engine (single word, really only works with small
sites), available:
<http://www.chariot.net.au/~jaq/matt/search.tar.gz>
(or look on Parnassus if it's moved :-)