IXEbot
IXEbot is an experimental web crawling bot (sometimes also called a
"spider"), developed at the
Dipartimento di
Informatica of the
Università di
Pisa.
Crawling is the process by which IXEbot discovers new and
updated pages to be used to explore techniques of Machine Reading from Web pages.
IXEbot uses an algorithmic process: computer programs determine which
sites to crawl, how often, and how many pages to fetch from each site.
IXEbot's crawl process begins with a list of webpage URLs, generated from
previous crawl processes and augmented with Sitemap data provided by
webmasters. As IXEbot visits each of these websites it detects links (SRC
and HREF) on each page and adds them to its list of pages to crawl. New sites,
changes to existing sites, and dead links are noted and used to update the
IXE index.
For webmasters: IXEbot and your site
How IXEbot accesses your site
For most sites, IXEbot shouldn't access your site more than once every few
seconds on average.
In order to reduce the costs of creating connections and to reduce network
congestion, IXEbot downloads a few pages (20-30) per connection.
IXEbot
exploits
HTTP
persistent connection using the request header:
Connection: Keep-Alive
This will show up in your logs as multiple GET requests, but in reality only a
single connection is opened with your server, similarly to what a Web browser
does normally, to download several files for a single Web page.
IXEbot limits the duration of each connection to 60 seconds, so it will not
download too much data for each connection.
Our goal is to crawl as many pages from your site as we can on each visit
without overwhelming your server's bandwidth.
Request a change in the crawl rate
You can request that IXEbot accesses your server less frequently, by setting
a value for Crawl-delay in your
robots.txt file.
For example, this will set the delay between two consecutive connections to
your site to a minimum of 10 seconds:
Crawl-delay: 10
Blocking IXEbot from content on your site
If you want to prevent IXEbot from crawling content on your site, you can
use
robots.txt to block access to
files and directories on your server.
Once you've created your
robots.txt
file, there may be a small delay before IXEbot discovers your changes. If
IXEbot is still crawling content you've blocked in robots.txt, check that the
robots.txt is in the correct location. It must be in the top directory of the
server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory
won't have any effect.
If you just want to prevent the "file not found" error messages in your web
server log, you can create an empty file named robots.txt.
If you want to prevent IXEbot from following any links on a page of your site,
you can use the nofollow meta tag. To prevent IXEbot from following an
individual link, add the rel="nofollow" attribute to the link itself.
Problems with spammers and other user-agents
The IP addresses used by IXEbot change from time to time. The best way to
identify accesses by IXEbot is to use the user-agent (IXEbot). You can
verify that a bot accessing your server really is IXEbot by using a reverse
DNS lookup.
For example:
> host 131.114.136.66
66.136.114.131.in-addr.arpa domain name pointer attardi-2.itc.unipi.it.
> host attardi-2.itc.unipi.it
attardi-2.itc.unipi.it has address 131.114.136.66