Multimedia Laboratory
Projects
Publications
IXEbot
Resources
Wiki
Mail

IXEbot

IXEbot is an experimental web crawling bot (sometimes also called a "spider"), developed at the Dipartimento di Informatica of the Università di Pisa. Crawling is the process by which IXEbot discovers new and updated pages to be used to explore techniques of Machine Reading from Web pages. IXEbot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site. IXEbot's crawl process begins with a list of webpage URLs, generated from previous crawl processes and augmented with Sitemap data provided by webmasters. As IXEbot visits each of these websites it detects links (SRC and HREF) on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the IXE index.

For webmasters: IXEbot and your site

How IXEbot accesses your site

For most sites, IXEbot shouldn't access your site more than once every few seconds on average. In order to reduce the costs of creating connections and to reduce network congestion, IXEbot downloads a few pages (20-30) per connection. IXEbot exploits HTTP persistent connection using the request header:
Connection: Keep-Alive
This will show up in your logs as multiple GET requests, but in reality only a single connection is opened with your server, similarly to what a Web browser does normally, to download several files for a single Web page. IXEbot limits the duration of each connection to 60 seconds, so it will not download too much data for each connection. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth.

Request a change in the crawl rate

You can request that IXEbot accesses your server less frequently, by setting a value for Crawl-delay in your robots.txt file. For example, this will set the delay between two consecutive connections to your site to a minimum of 10 seconds:
Crawl-delay: 10

Blocking IXEbot from content on your site

If you want to prevent IXEbot from crawling content on your site, you can use robots.txt to block access to files and directories on your server. Once you've created your robots.txt file, there may be a small delay before IXEbot discovers your changes. If IXEbot is still crawling content you've blocked in robots.txt, check that the robots.txt is in the correct location. It must be in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect. If you just want to prevent the "file not found" error messages in your web server log, you can create an empty file named robots.txt. If you want to prevent IXEbot from following any links on a page of your site, you can use the nofollow meta tag. To prevent IXEbot from following an individual link, add the rel="nofollow" attribute to the link itself.

Problems with spammers and other user-agents

The IP addresses used by IXEbot change from time to time. The best way to identify accesses by IXEbot is to use the user-agent (IXEbot). You can verify that a bot accessing your server really is IXEbot by using a reverse DNS lookup. For example:
> host 131.114.136.66
66.136.114.131.in-addr.arpa domain name pointer attardi-2.itc.unipi.it.
> host attardi-2.itc.unipi.it
attardi-2.itc.unipi.it has address 131.114.136.66
Dipartimento di Informatica, Pisa, Italy