Majestic 12 tweaking guide

**[XC] moddolicous** · 07-01-2007, 05:09 PM

Yep, I finished it. I will fix it up as I kind of rushed the last part. In order to beat those Norwegians (and in the future, Free-DC) we will have to make our clients work to their potential. Depending on your connection, anywhere between 100k and 500k urls could be gained. Of course, every internet connection and computer is different, so you will have to tweak it to your computer. There are 6 sections we will have to worry about. They are Personal, Connection, Crawler, Profiles, Misc. and archiving. First up is Personal. This is what it looks like:

This is what a typical personal tab should look like.
#1 and #2 should both represent you and not me. If you want to run the client under my name, that is fine. When you first start the client, you will need to enter your password, but after that it will be unneeded.
#3 is just to keep things tidy. Each one should be personalized to the computer it is running on so that it is easy to fix problems.
#4 is really only for people in different areas of the world. One example is Hixie. He is located in Hong Kong. For this reason, it is to his benefit to select domains from Hong Kong (.hk, .jp, etc). For most people it won’t make a huge difference (people in US, UK, etc) but for the more secluded members, it could help. That’s about it for the personal tab.

**[XC] moddolicous** · 07-01-2007, 05:09 PM

This is how a typical connection tab looks:

#1 and #2 are both figures we put in for our downstream and upstream speeds. They increase in measurements of 128. Both of them go up to 102,400. It is not good to go up that high unless you have a fast computer, a fast ISP and a good router. Even when you enter it, you will be receiving URLs at 102mb a sec. It will only go as fast as your ISP will let it. It usually doesn’t hurt to fib a little about your figures. These are just basic measurements of your line. Also, the sliders to the right tell Majestic how much of your line to use. I always keep them at 100% and just change the downstream/upstream values.
#3 is the timeout used when talking to the server, rather than timeout of urls. It is best kept at ten. That is right from the creator of Majestic. Basically, it is how often the client talks to the main server.
#4 no longer applies really. It should be removed, so you can ignore it completely. I actually have no idea what it does, but keeping it low can only speed things up, right?
#5 just keeps a count on how much you uploaded / downloaded. This is useful for people with network caps, but for the lucky uncapped people, you could uncheck this to save some CPU power.
#6 just lets the computer connect straight to the server rather than differently. If you use proxy, this will be ignored.
#7 specifies which NIC to keep track of for #5. You can change it around to keep track of how much you downloaded / uploaded on a different NIC card.
#8 is proxy settings if you use a proxy. I don’t think you have to check Do Not use proxy for uploading if you aren’t using a proxy, but it can’t hurt. That is it for the connection tab.

**[XC] moddolicous** · 07-01-2007, 05:10 PM

The crawler tab is easily the most important tab. It is the one that will affect your results the most. This is what mine looks like:

A lot of settings, right? Don’t worry; I’ll guide you through it.
#1 is the amount of workers you have going through your buckets. If you have about 30 domains left in a bucket, the MAX number of crawlers for that bucket will be 30. That is why it is important to look at #2. #2 is the amount of URL buckets you can have open at once. If this number is too low, you will never get to your max number of crawlers. If it is too high, your crawlers may be too spread out. I usually keep it at one third of my workers which seems to work. The one thing to worry about is with every new bucket that is open, it will take up ~300mb of space on your hard drive. That is why with this program it is a good idea to defrag once every week.
#3 will keep connection alive to the various websites you will be crawling. If your router freezes a lot, this could be an issue. Having this enabled will use more memory, but should increase performance.
#4 is the way that your crawlers search the website. I have had benefits with both. In the past, the alternative library has given me more websites, but at this point it doesn’t make a difference. If the standard .net library doesn’t work for you, then switch it.
#5 is how big the data chunks are sent back to the server. To figure this out, go to tools and benchmark uploading.
#6 is the same as #3, so there is no need to explain.
#7 is just a way to speed up uploading for the most part. If you have a really slow upload speed compared to download speed, then you might want to enable this. If you check this option you will effectively DISABLE upstream throttling and your node will upload at the maximum possible speed it can. #8, #9 and #10 are variables that will also determine when your client uploads its results. If you have no delays check off, I believe the wait between barrel uploads is basically disabled. To stop a huge backup of buckets to upload, #9 will tell your client when to stop uploading barrels if there is a backlog. For instance, if my client has 20 barrels backlogged for upload, then it will stop crawling and upload them. Once they reach 5, my client will start crawling again.
#11 is how long after the client starts that it will crawl. Setting this to zero doesn’t do anything, and it starts a lot better if you have it set to 5.
#12 is the amount of backup buckets you have if the server handing them out goes down. You can set this number high, although some of them can be fairly large.
#13 is used to prevent MJ12 from taking over your whole hard drive, you can enter a min free disk space for the client to watch for . If it gets near there, it will stop crawling.
#14 is just how many buckets your computer should have when it is near low on disk. I don’t think many of you will have to worry about this.
#15 and #16 both control how much CPU the client itself will use. Robots.txt and URL flush delays allow to avoid CPU spikes when internal caches get flushed once a minute or less, you can put them to 0 to have maximum speed - you probably won't notice difference as these settings come from old time when things were not optimized: all in all they should not really affect crawling, though in theory if reduced it might help a bit: the only downside is that during flush time CPU might spike to 100% for a second or so, depending on how you use your computer it might not be an issue. The client has a cool tool that will keep track and graph your internet usage, CPU usage and memory usage. In order to get the last two, you will have to enable this.
#18 pertains to errors that your client will get. Say you have a bucket where you are getting a bunch of DNS errors. If you get 10 or more in a row, the client will recheck them to make sure that it is a DNS error. This is used to boost your success rate. By making the threshold lower, you are lowering the amount of errors needed to be rechecked. Turning it off, your success rate might get lowered, but you might also crawl more urls.

**[XC] moddolicous** · 07-01-2007, 05:10 PM

I am not going to lie. There is already a really good guide on how to make and use profiles, so I will link to that:
http://www.majestic12.co.uk/forum/viewtopic.php?t=1547
If need be, I will make one.

Misc part is next:

Let me just clarify that the stock ones work great and setting them too high will make your computer unusable. #1 is the priority MJ12 will run at. If you have it on a dedicated computer, you could set it to high, but on a regular computer, it is better off to stay at normal or lower. It is just the work that the client itself has to do.
#2 and #3 are the priorities that the workers and crawlers work at. These are the ones that crawl the buckets for Urls and such. #4 and #5 are the only ones I change.
#4 is the priority that the client archives the finished buckets. For me, I get the same speed keeping them at normal and below normal, and having it on normal would usually slow down my computer.
#5 is upload and uses no CPU power (that I notice) and it helps greatly with uploads. I always set this to high.
#6 is used if you want to check what those OTHER errors are on your client are, or you want to change settings when you are away from your computer (but not away from your network I think), then enable this. There are a few other things you can do from the web server, such as restart the client.
#7 logs any errors the client encounters. These come in handy when your client is acting weird.
#8 provides warnings if your success rate becomes too low because of the wrong reasons (timeouts too high, DNS errors too high).
#9 just clears your log files when the client starts.
#10 keeps an eye on your amount of DNS errors. If they get high (like 10 %+) then your client might issue a warning of sorts.
#11 just lets your client minimize to the tray when it starts.
#12 allows your client to restart if errors get too high.

**[XC] moddolicous** · 07-01-2007, 05:11 PM

This is another very important section (almost as much as crawler)

#1 should always be checked off. By having it not checked off, the archiver will run under the mj12 program and could lead to hanging.
#2 should be the same as the archiving process in the MISC part.
#3 minimizes disk usage during archiving, but it will use more CPU.
#4 should be used unless you have a slower computer. The settings I used results in the smallest, most compact buckets being resent. It is also MUCH easier on the server.
#5 should be used if your computer if P4 are previous. The only reason for this is because I believe RAR is a faster archiver, but it is harder for the server to categorize the buckets.

That is my guide. I know that there are a few oddly worded sentences and bad explanations. Over the next couple of days I will fix it up and clean it up. Feedback is appreciated.

**hixie** · 07-02-2007, 05:18 AM

So you mean LZMA is better an winRAR?

**gmod** · 07-02-2007, 06:06 AM

So where do download the program

**Frisch** · 07-02-2007, 06:11 AM

Originally Posted by gmod

So where do download the program

HERE

**[XC] moddolicous** · 07-02-2007, 04:38 PM

Originally Posted by hixie

So you mean LZMA is better an winRAR?

http://www.majestic12.co.uk/forum/viewtopic.php?t=827

**Okda** · 07-08-2007, 07:41 AM

Sticky Please

EDIT:

well after reading the guide

the thing i really changed was the number of buckets and active urls it was 20/12 i changed to 30/12
before the prg i used to crawl less than 24,000 url in every hour , while now it increased to 30,00+ with the success rate being even higher than before

i am on ADSL 2mbit/512kbit

**Duh** · 01-03-2009, 06:17 PM

dumb I did not see this before.. thank you very much mate!

**~~onex~~** · 07-24-2009, 06:18 PM

moddolicous,
i'll add to you'r guide here,
at the Crawler Tab -
Hard errors should possibly be enough at 5, DNS errors&retry ON, connection errors&retry ON, connection time out ON, retry - OFF,
if the connection has been timed out, u probably just waste u'r time to try it again,
here it's ~158 at 20,000 (about 0.8%), it's not much but seems just a waste, and at 1m fetches, it could get at about 8k..
if it can make up to 0.3% difference at an hi URL count,
then even 3k URLS at about 1/2 a second retrying each,
it could be not worth it,
or,
maybe it's actually almost the same..

as for the 5 (10) hard error settings,
u can try it twice at about 10k-20k URL's and see how different are the results,
it can get up to a whole 6 percent.
regards.

**lkiller123** · 07-31-2010, 12:56 PM

How do I install this as a service?
I have a crapload of computers (with 50-60Mbit/s downsteam and 10Mbit/s upstream) in my community college, and I can actually install this to all of them. But the thing is, once I am finished, I will have to log out. How can I get it to work even while the computer is not logged in?

**DeadlyFire** · 07-31-2010, 03:23 PM

Originally Posted by lkiller123

How do I install this as a service?
I have a crapload of computers (with 50-60Mbit/s downsteam and 10Mbit/s upstream) in my community college, and I can actually install this to all of them. But the thing is, once I am finished, I will have to log out. How can I get it to work even while the computer is not logged in?

Welcome to the team lkiller123!

I'm not sure how to install it as a service but unless you change something after the first time installation, the Majestice 12 node will start up automatically with Windows. If you plan on installing it on college computers it might be a good idea to get permission from your IT department before you do. When we say Majestic 12 eats up bandwidth like there's no tomorrow we're not joking!

**lkiller123** · 07-31-2010, 03:27 PM

Originally Posted by DeadlyFire

Welcome to the team lkiller123!

I'm not sure how to install it as a service but unless you change something after the first time installation, the Majestice 12 node will start up automatically with Windows. If you plan on installing it on college computers it might be a good idea to get permission from your IT department before you do. When we say Majestic 12 eats up bandwidth like there's no tomorrow we're not joking!

Sucks to have it not installed as a service

Anyways, for some reason they just won't make use of the 50mbit connection. Max I've seen was 6-7mbit after a while when I set the preferred as ".us"

Does computer hardware matter in MJ12?

edit: Is it just me or I can see myself on the MJ12 homepage?

**DeadlyFire** · 07-31-2010, 11:31 PM

Originally Posted by lkiller123

Sucks to have it not installed as a service

Anyways, for some reason they just won't make use of the 50mbit connection. Max I've seen was 6-7mbit after a while when I set the preferred as ".us"

Does computer hardware matter in MJ12?

edit: Is it just me or I can see myself on the MJ12 homepage?

MJ12 doesn't use too much memory, around 50-200mb max per node depending on settings. The most important part is probably the CPU because when the node is archiving(compressing) the finished URLs, CPU usage jumps up a good bit. You can always set the priority of the archiving process to low(which opens in a separate process from the MJ12 node called MJ12bar.exe).

**lkiller123** · 04-05-2011, 03:18 PM

Can someone guide me through the crawler tab config?

I have a 6mb/0.5mb connection. I just want to know about the async workers and bucket numbers.

**Duh** · 04-05-2011, 03:34 PM

I m in a similar situation as jkiller is: Ive got a 6mbit cablemodem connection but Im afraid my modem gets suck ( even with very low amount of workers and buckets).
I think the number of connections might be too much for it to handle them.

Any clue of how I can make sure of this?

Ive tried removing the wifi router and situation remains the same.

Appreciate

**OldChap** · 04-05-2011, 03:49 PM

as shown above post 3

1= 100 to 120 ish
2= 30 to 40 ish

for each of these start low and tweak higher

12= 3
18 tick all to start then as you get results and confirm error rate some can be unchecked

the rest of the crawler tab as pic

if you find you are getting high dns errors it may be best to change dns server

**OldChap** · 04-05-2011, 03:58 PM

Duh, my load just now is full of recrawl and seemingly poor buckets. My state table size is 3 times normal so maybe now is not the best time to attempt to optimise unless you do so by setting 0 or maybe 1 in priority buckets under the more crawler tab (better for your numbers but not good for MJ)

Edit what router are you using?

**Duh** · 04-06-2011, 05:43 AM

thx for the piece of advice.

Im using a wrt 300 n ( linksys)

my rig is a c2d with 4 gigs of ram and 3 disks

**lkiller123** · 04-06-2011, 10:06 AM

Thanks OC.

**Duh** · 04-06-2011, 10:23 AM

1= 100 to 120 ish
2= 30 to 40 ish

thx OC. Im using currently 35 for 1 and 40 for 2 . Most odd sh.. is that it crawls just "fine" ( never passes the 3.5 mbit, even raising 1 and 2 bit by bit )the first 2-3 mins and then stalls..

edit: sh... it happened once again:

http://i.imgur.com/e6J1m.jpg

edit 2 : I ve seen I had some big numbers in flush delay ( robots , URLs)

edit 3: here is my setup
http://imgur.com/a/m1FwB

Appreciate

Thread: Majestic 12 tweaking guide

Thread Tools

Search Thread

Rate This Thread

Display

Majestic 12 tweaking guide

Bookmarks

Bookmarks

Posting Permissions