The crawler tab is easily the most important tab. It is the one that will affect your results the most. This is what mine looks like:
A lot of settings, right? Don’t worry; I’ll guide you through it.
#1 is the amount of workers you have going through your buckets. If you have about 30 domains left in a bucket, the MAX number of crawlers for that bucket will be 30. That is why it is important to look at #2. #2 is the amount of URL buckets you can have open at once. If this number is too low, you will never get to your max number of crawlers. If it is too high, your crawlers may be too spread out. I usually keep it at one third of my workers which seems to work. The one thing to worry about is with every new bucket that is open, it will take up ~300mb of space on your hard drive. That is why with this program it is a good idea to defrag once every week.
#3 will keep connection alive to the various websites you will be crawling. If your router freezes a lot, this could be an issue. Having this enabled will use more memory, but should increase performance.
#4 is the way that your crawlers search the website. I have had benefits with both. In the past, the alternative library has given me more websites, but at this point it doesn’t make a difference. If the standard .net library doesn’t work for you, then switch it.
#5 is how big the data chunks are sent back to the server. To figure this out, go to tools and benchmark uploading.
#6 is the same as #3, so there is no need to explain.
#7 is just a way to speed up uploading for the most part. If you have a really slow upload speed compared to download speed, then you might want to enable this. If you check this option you will effectively DISABLE upstream throttling and your node will upload at the maximum possible speed it can. #8, #9 and #10 are variables that will also determine when your client uploads its results. If you have no delays check off, I believe the wait between barrel uploads is basically disabled. To stop a huge backup of buckets to upload, #9 will tell your client when to stop uploading barrels if there is a backlog. For instance, if my client has 20 barrels backlogged for upload, then it will stop crawling and upload them. Once they reach 5, my client will start crawling again.
#11 is how long after the client starts that it will crawl. Setting this to zero doesn’t do anything, and it starts a lot better if you have it set to 5.
#12 is the amount of backup buckets you have if the server handing them out goes down. You can set this number high, although some of them can be fairly large.
#13 is used to prevent MJ12 from taking over your whole hard drive, you can enter a min free disk space for the client to watch for . If it gets near there, it will stop crawling.
#14 is just how many buckets your computer should have when it is near low on disk. I don’t think many of you will have to worry about this.
#15 and #16 both control how much CPU the client itself will use. Robots.txt and URL flush delays allow to avoid CPU spikes when internal caches get flushed once a minute or less, you can put them to 0 to have maximum speed - you probably won't notice difference as these settings come from old time when things were not optimized: all in all they should not really affect crawling, though in theory if reduced it might help a bit: the only downside is that during flush time CPU might spike to 100% for a second or so, depending on how you use your computer it might not be an issue. The client has a cool tool that will keep track and graph your internet usage, CPU usage and memory usage. In order to get the last two, you will have to enable this.
#18 pertains to errors that your client will get. Say you have a bucket where you are getting a bunch of DNS errors. If you get 10 or more in a row, the client will recheck them to make sure that it is a DNS error. This is used to boost your success rate. By making the threshold lower, you are lowering the amount of errors needed to be rechecked. Turning it off, your success rate might get lowered, but you might also crawl more urls.
Bookmarks