Web Site Crawler

SiteMap Generation

On-Page SEO

IziSEO crawls all over your web site using the links inside page content. It saves all the pages, internal and external links, references to images, CSS stylesheets, JavaScript files and even to form action URLs.

Here is an extended article on what exactly a site crawler is.

The results of web site crawling can be saved as a project file for later analysis. IziSEO projects can be opened and saved like usual files. You don't lose any of your results when you close the program.

The average time needed to crawl a web site of 300 pages is 3-5 minutes.

You can configure the spider to make random delays between URL requests. This may help prevent the load on the site and keep you on the safe side in case you are crawling someone else's web site.

Below are the settings you can configure for the crawler and its behavior.

SEO Crawler settings

User Agent

Set a custom User Agent for the project. You may imitate the use of a particular browser by setting this field to the appropriate value.

Proxy List

If you're working via a proxy, especially if you're willing to protect your identity of the IP of your computer, then add one or more proxy servers here.
If you add more than one, then IziSEO will pick one randomly from your list each time it's about to make a request to the website. This is a great way to prevent your IP being banned from a website, as in this case all your requests will seem to be performed from different places.

You can quickly import a prepared list of proxy servers from a text file. Inside the file each proxy server must be on a separate line. Each line must have the following format:

[host]|[port number]|[1 if password needed/0 otherwise]|[username]|[password]

Examples:

37.59.176.38|8089|1|user2|password2
85.88.8.51|6588|0||

Connection timeout (in sec.)

Choose the time it takes to wait before a request to the web site is cancelled. In that case the URL will be skipped and the program will go ahead with the rest of the pages (unless, of course, you're having issues with your Internet provider or the site is down).

Max level depth

Choose the maximum level for requested pages. A level is determined by the number of steps it takes to get to a page from the home page.

Max number of pages

Choose the maximum number of pages you want to crawl.

 

SEO Crawler behavior

Conform to robots.txt

Turn this on if you want the program to download the robots.txt file and respect its directives. That means not fetching certain URLs if it is forbidden by robots.txt.

Follow redirects (HTTP 301 and 302)

Let the program follow the redirection route up to the last URL that returns a non-redirection server response.

Crawl subdomains

Tick this if you want to crawl URLs who share the same top level domain. E.g. if this option is on, then URLs http://abc.topdomain.com, http://def.topdomain.com and http://topdomain.com/login/ will all be considered from the same domain.

Crawl "nofollow" links

Turn this on if you want the crawler to check links with the "nofollow" rel attribute.

Check images (crawl SRC)

Tick this option to check for the existence of each src URL in <img> tags. The program will attempt to get the header of this URL, i.e. just check if it is there.

Load image content

If you enable the previous option, this one will actually load the whole image into the program instead of just checking if it exists. You will be able to open the images from the image list of each page.

Check CSS links

Tick this option to check for the existence of each href URL in <link rel="stylesheet"> tags. The program will attempt to get the header of this URL, i.e. just check if it is there.

Check JS links

Tick this option to check for the existence of each href URL in <script> tags. The program will attempt to get the header of this URL, i.e. just check if it is there.

Check form action links

Tick this option to check for the existence of each action URL in <form> tags. The program will attempt to get the header of this URL, i.e. just check if it is there.

Don't check external images, CSS, JS and forms

Tick this if you do not want to check the existence of the resources whose URLs belong to a different domain.

Don't crawl URLs containing these patterns

You have an opportunity to exclude certain groups of URLs from being crawled. List all patterns that should be ignored, each on a separate line. The search for these patterns is case-insensitive. The patterns can be both plain text strings and regular expressions. If you wish to use regular expressions, add @@ at the front and end of the pattern.

Make random delays between requests

You can choose to wait for a while before making each subsequent request to the server, so that you don't qualify as a DOS attacker, especially for web sites that have a massive number of pages.

jackpot city casino
Copyright @ 2015 SoftFactum.com     27 Hillier Str., Sheung Wan, Hong Kong