This is a version of weblech web crawling spider. I have modified the code for the specific needs of a project that i am working on for L3D at the University of Colorado. I will be adding a few additional improvements in a few weeks as well. The biggest improvement over the original code is that the parser is far better than the orginal weblech. The parser grabs more urls and misses far fewer. Also the code now has regular expression filtering so you can have the spider ignore certian parts of the web. The last large improvement is the ability to extra pure textual data from web pages instead of all the html. This allows the spider to become a data miner on the web.
Original weblech
My new code
simplecrawl.zip
e-mail any questions or comments to Dan@bandddesigns.com