Search engine robots fascinate me. In fact, so much so that I have been building my own search engine over the past 7 months just to understand how they work so that my SEO efforts for my clients improves.
During this process, I’ve learned a lot about what robots face out there on the web. These days, spiders and crawlers are generally interchangeable words and are thought to do the same thing (because their duties have often been combined in the modern SE). But, in making my own little search engine, I’ve found that there is a subtle difference between the two.
The truth is, crawlers simply parse a page and index the words on that page into the database. Spiders in their pure form, on the other hand and true to their name, transverse the “web” through hyperlinks and gather new links to pages which the crawlers will then parse and index.
In making my simple little search engine, I first organized and made the database. Then came the simplistic crawler that I made. This crawler could be pointed to a folder on my hard drive and it would parse any html files found in it. I could then search for a word and it would bring up any pages that contained that word in the folder. Not very sophisticated, but that was a start.
It took me awhile to figure out the parsing and indexing process and after a couple of months, I thought let’s add a spider to this that could be pointed to a website online and send the crawler to parse its pages. Soon, with a little help from a friend that knew more about C++ programing than myself, I had my spider.
It kinda worked. I was still in a rudimentary stage. I found that I could parse html, but the crawler kept “bombing out” when it met up with pdf and docs. I also trained my crawler and database to recognize word phrases when the documents were retrieved and displayed in the results pages.
Up to this point, I always pointed my spider at my own websites. But then I decided to point it at a friend’s site. Boy, that got me a nasty email from him. It was at this time I decided I had better train my spider to read robot.txt files so it would stay out of sensitive areas of a site. Now I had a spider that behaved.
It was at this point that I realized that the database had to change. I also needed a better means of retrieving documents and ranking them. I spent weeks reading thesis’s and papers on the different search engines, their database arrangement, and their algorithms to see how they did this. In the process, I now have a greater understanding of search engines and a greater respect for them.
I’m now ready to go back to the drawing board and hopefully fix some of the flaws that my search engine inherently has. I’m training the crawler to read pdfs and docs. I’ve begun to create a whole new program that emulates the search engines online. I’ll just have to see if I can do it.
Back to the original question, in my case, the crawler came first, then the spider. I would think that this was the case with the original search engines too. What I have done with my project is that I have gotten into the frame of mind of the search engines and their engineers. This I can use to aid my clients and myself in gaining higher rankings on the SERPs.
Understanding the search mentality will have its benefits in the future and if nothing else, hopefully you understand that a robot like Googlebot is actually a series of bots, each with their specific duties.