View Full Version : Majestic 12 tweaking guide
[XC] moddolicous
07-01-2007, 06:09 PM
Yep, I finished it. I will fix it up as I kind of rushed the last part. In order to beat those Norwegians (and in the future, Free-DC) we will have to make our clients work to their potential. Depending on your connection, anywhere between 100k and 500k urls could be gained. Of course, every internet connection and computer is different, so you will have to tweak it to your computer. There are 6 sections we will have to worry about. They are Personal, Connection (http://www.xtremesystems.org/forums/showpost.php?p=2284606&postcount=2), Crawler (http://www.xtremesystems.org/forums/showpost.php?p=2284608&postcount=3), Profiles, Misc. (http://www.xtremesystems.org/forums/showpost.php?p=2284610&postcount=4) and archiving (http://www.xtremesystems.org/forums/showpost.php?p=2284612&postcount=5). First up is Personal. This is what it looks like:http://img234.imageshack.us/img234/245/personalvj8.jpg
This is what a typical personal tab should look like.
#1 and #2 should both represent you and not me. If you want to run the client under my name, that is fine. When you first start the client, you will need to enter your password, but after that it will be unneeded.
#3 is just to keep things tidy. Each one should be personalized to the computer it is running on so that it is easy to fix problems.
#4 is really only for people in different areas of the world. One example is Hixie. He is located in Hong Kong. For this reason, it is to his benefit to select domains from Hong Kong (.hk, .jp, etc). For most people it won’t make a huge difference (people in US, UK, etc) but for the more secluded members, it could help. That’s about it for the personal tab.
[XC] moddolicous
07-01-2007, 06:09 PM
This is how a typical connection tab looks:
http://img234.imageshack.us/img234/5798/connectionvd2.jpg
#1 and #2 are both figures we put in for our downstream and upstream speeds. They increase in measurements of 128. Both of them go up to 102,400. It is not good to go up that high unless you have a fast computer, a fast ISP and a good router. Even when you enter it, you will be receiving URLs at 102mb a sec. It will only go as fast as your ISP will let it. It usually doesn’t hurt to fib a little about your figures. These are just basic measurements of your line. Also, the sliders to the right tell Majestic how much of your line to use. I always keep them at 100% and just change the downstream/upstream values.
#3 is the timeout used when talking to the server, rather than timeout of urls. It is best kept at ten. That is right from the creator of Majestic. Basically, it is how often the client talks to the main server.
#4 no longer applies really. It should be removed, so you can ignore it completely. I actually have no idea what it does, but keeping it low can only speed things up, right?
#5 just keeps a count on how much you uploaded / downloaded. This is useful for people with network caps, but for the lucky uncapped people, you could uncheck this to save some CPU power.
#6 just lets the computer connect straight to the server rather than differently. If you use proxy, this will be ignored.
#7 specifies which NIC to keep track of for #5. You can change it around to keep track of how much you downloaded / uploaded on a different NIC card.
#8 is proxy settings if you use a proxy. I don’t think you have to check Do Not use proxy for uploading if you aren’t using a proxy, but it can’t hurt. That is it for the connection tab.
[XC] moddolicous
07-01-2007, 06:10 PM
The crawler tab is easily the most important tab. It is the one that will affect your results the most. This is what mine looks like:
http://img406.imageshack.us/img406/815/crawlerkz8.jpg
A lot of settings, right? Don’t worry; I’ll guide you through it.
#1 is the amount of workers you have going through your buckets. If you have about 30 domains left in a bucket, the MAX number of crawlers for that bucket will be 30. That is why it is important to look at #2. #2 is the amount of URL buckets you can have open at once. If this number is too low, you will never get to your max number of crawlers. If it is too high, your crawlers may be too spread out. I usually keep it at one third of my workers which seems to work. The one thing to worry about is with every new bucket that is open, it will take up ~300mb of space on your hard drive. That is why with this program it is a good idea to defrag once every week.
#3 will keep connection alive to the various websites you will be crawling. If your router freezes a lot, this could be an issue. Having this enabled will use more memory, but should increase performance.
#4 is the way that your crawlers search the website. I have had benefits with both. In the past, the alternative library has given me more websites, but at this point it doesn’t make a difference. If the standard .net library doesn’t work for you, then switch it.
#5 is how big the data chunks are sent back to the server. To figure this out, go to tools and benchmark uploading.
#6 is the same as #3, so there is no need to explain.
#7 is just a way to speed up uploading for the most part. If you have a really slow upload speed compared to download speed, then you might want to enable this. If you check this option you will effectively DISABLE upstream throttling and your node will upload at the maximum possible speed it can. #8, #9 and #10 are variables that will also determine when your client uploads its results. If you have no delays check off, I believe the wait between barrel uploads is basically disabled. To stop a huge backup of buckets to upload, #9 will tell your client when to stop uploading barrels if there is a backlog. For instance, if my client has 20 barrels backlogged for upload, then it will stop crawling and upload them. Once they reach 5, my client will start crawling again.
#11 is how long after the client starts that it will crawl. Setting this to zero doesn’t do anything, and it starts a lot better if you have it set to 5.
#12 is the amount of backup buckets you have if the server handing them out goes down. You can set this number high, although some of them can be fairly large.
#13 is used to prevent MJ12 from taking over your whole hard drive, you can enter a min free disk space for the client to watch for . If it gets near there, it will stop crawling.
#14 is just how many buckets your computer should have when it is near low on disk. I don’t think many of you will have to worry about this.
#15 and #16 both control how much CPU the client itself will use. Robots.txt and URL flush delays allow to avoid CPU spikes when internal caches get flushed once a minute or less, you can put them to 0 to have maximum speed - you probably won't notice difference as these settings come from old time when things were not optimized: all in all they should not really affect crawling, though in theory if reduced it might help a bit: the only downside is that during flush time CPU might spike to 100% for a second or so, depending on how you use your computer it might not be an issue. The client has a cool tool that will keep track and graph your internet usage, CPU usage and memory usage. In order to get the last two, you will have to enable this.
#18 pertains to errors that your client will get. Say you have a bucket where you are getting a bunch of DNS errors. If you get 10 or more in a row, the client will recheck them to make sure that it is a DNS error. This is used to boost your success rate. By making the threshold lower, you are lowering the amount of errors needed to be rechecked. Turning it off, your success rate might get lowered, but you might also crawl more urls.
[XC] moddolicous
07-01-2007, 06:10 PM
http://img146.imageshack.us/img146/3650/profileshr8.jpg
I am not going to lie. There is already a really good guide on how to make and use profiles, so I will link to that:
http://www.majestic12.co.uk/forum/viewtopic.php?t=1547
If need be, I will make one.
Misc part is next:
http://img146.imageshack.us/img146/7739/miscxf7.jpg
Let me just clarify that the stock ones work great and setting them too high will make your computer unusable. #1 is the priority MJ12 will run at. If you have it on a dedicated computer, you could set it to high, but on a regular computer, it is better off to stay at normal or lower. It is just the work that the client itself has to do.
#2 and #3 are the priorities that the workers and crawlers work at. These are the ones that crawl the buckets for Urls and such. #4 and #5 are the only ones I change.
#4 is the priority that the client archives the finished buckets. For me, I get the same speed keeping them at normal and below normal, and having it on normal would usually slow down my computer.
#5 is upload and uses no CPU power (that I notice) and it helps greatly with uploads. I always set this to high.
#6 is used if you want to check what those OTHER errors are on your client are, or you want to change settings when you are away from your computer (but not away from your network I think), then enable this. There are a few other things you can do from the web server, such as restart the client.
#7 logs any errors the client encounters. These come in handy when your client is acting weird.
#8 provides warnings if your success rate becomes too low because of the wrong reasons (timeouts too high, DNS errors too high).
#9 just clears your log files when the client starts.
#10 keeps an eye on your amount of DNS errors. If they get high (like 10 %+) then your client might issue a warning of sorts.
#11 just lets your client minimize to the tray when it starts.
#12 allows your client to restart if errors get too high.
[XC] moddolicous
07-01-2007, 06:11 PM
This is another very important section (almost as much as crawler)
http://img138.imageshack.us/img138/6898/archivinghz7.jpg
#1 should always be checked off. By having it not checked off, the archiver will run under the mj12 program and could lead to hanging.
#2 should be the same as the archiving process in the MISC part.
#3 minimizes disk usage during archiving, but it will use more CPU.
#4 should be used unless you have a slower computer. The settings I used results in the smallest, most compact buckets being resent. It is also MUCH easier on the server.
#5 should be used if your computer if P4 are previous. The only reason for this is because I believe RAR is a faster archiver, but it is harder for the server to categorize the buckets.
That is my guide. I know that there are a few oddly worded sentences and bad explanations. Over the next couple of days I will fix it up and clean it up. Feedback is appreciated.
hixie
07-02-2007, 06:18 AM
So you mean LZMA is better an winRAR?
So where do download the program
Frisch
07-02-2007, 07:11 AM
So where do download the program
HERE (http://www.majestic12.co.uk/projects/dsearch/download.php)
[XC] moddolicous
07-02-2007, 05:38 PM
So you mean LZMA is better an winRAR?
http://www.majestic12.co.uk/forum/viewtopic.php?t=827
SaFrOuT
07-08-2007, 08:41 AM
Sticky Please
EDIT:
well after reading the guide
the thing i really changed was the number of buckets and active urls it was 20/12 i changed to 30/12
before the prg i used to crawl less than 24,000 url in every hour , while now it increased to 30,00+ with the success rate being even higher than before :)
i am on ADSL 2mbit/512kbit
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.