Link Data Research – Majestic SEO (Fresh Index), Moz Linkscape and Ahrefs compared

Initial Research Study of backlink data for 100 sites

Back in 2008, in the very early days of Analytics SEO we built our own web crawler and set about trying to crawl the web.

As we crawled more and more weird and wonderful websites, our data grew almost exponentially and we soon realised that in no time at all we would have more servers than chairs in our office!

 

Given the ever-increasing nature of the task (trillions of URLs and counting), the size of the data to be crawled, processed and analysed was at a scale that meant that crawling the web and licensing the database was going to become even more of a specialist function.

In 2010, we signed a licensing deal with Majestic SEO to incorporate their link data into Analytics SEO for all our customers. We still crawl the web today, and our crawler is much more advanced than it was back then (we’ve seen lots more peculiar sites to help us refine it), but today we focus on spidering client sites only and re-crawling backlink data.

In 2013, there are now a few more specialist link data providers of which Majestic SEO, Ahrefs and Moz are the major providers. There have been a number of blog posts and articles comparing each provider for coverage, depth and quality; some of them have provoked a lot of debate. However, we felt that most of these comparisons only looked at data for a handful of sites and in order to get a clearer picture of the relative strengths of each provider, a comparison of data across a much larger sample size was necessary. So that’s what we have set out to do.

Today, we’re publishing a summary of our initial findings across 100 sites. But we are already in the process of collecting data for 1,000 randomly selected sites and will follow up with an even more in-depth study in a matter of weeks.

The purpose of publishing the initial study now is to ensure we have thoroughly considered all angles to make sure that our extensive comparison will be as fair and objective as possible. We know how hard it is to successfully crawl the web (after all we gave up doing it) and we completely respect what all three of these companies have achieved. Let’s not forget that without these guys, reporting on our link efforts would be like going into a gun fight with a water pistol!

Research Methodology

We wanted to undertake more thorough research across more sites than had been attempted before. But there are obvious limits in terms of what’s practical and feasible to achieve in a reasonable period of time – after all some sites have millions of links. We decided to exclude these large sites for many practical reasons; one of which was that we wanted to re-crawl all these links and we felt that we might make ourselves unpopular if we made too many simultaneous requests from each site.

In the end, we determined to compare sites that were reported as having less than 50,000 backlinks by all data providers; this nicely matched up with the data providers’ API constraints.

What we are doing here is an ‘end-user’ analysis of the total number of links available for selected sites from the three backlink providers and not an analysis of how well their crawlers and indexation works (this has been done before, many times over).

At this stage we’d like to thank the data providers for providing free access to their APIs for the purpose of this research.

We selected 1,000 sites at random from our database. We chose this sample as it enabled us to sense check the data that came back; we also have anonymized Google Analytics data for this data so we could start to look at how well each provider actually picked up links that are referring traffic to a given site. From this base we have selected 100 sites at random for this initial study.

We used each provider’s API to get up to 50,000 backlinks for each of the 100 sites.

We re-crawled all the backlinks using our own crawler and followed any re-directs in place and then checked every source URL to see whether it was indexed in Google (this amounted to ~6m checks for just 100 sites!).

The first piece of this research is based on source URLs only (a source URL can have multiple outbound links), so other link factors such as hostnames, unique domains, IPs, Unique Class Cs or referrals, whilst accounted for in the research, are ignored at this stage. We are interested in the absolute total numbers of source URLs found by all 3 providers. We then get into the total number of links thereafter.

Please remember this is just the initial study; further in-depth research and analysis is already underway on the full data set – so if you have any suggestions or critique then do please comment below.

OK, let’s get stuck into some data!

For the purposes of a quick preview into the full set of data (1,000 sites) we analysed the backlinks of 100 sites (maximum of 50K links per site).  Please note, when we talk about Majestic SEO we were only analysing its Fresh Index.

Total number of sites found by data provider (Base n=100).

It seems that all providers have data for the given sites. Could this mean they have 100% coverage? Evidently, this sample size is too small to imply this.  But it will be interesting to see how this coverage changes across 1,000 sites (which is admittedly still not a huge sample size given the size of the web).

If each provider has 100% coverage then will this ratio also apply to the total number of links they can serve per site? i.e. 100% of all links available to any given site or page?  To find out we then compared the total number of Root Domains and Source URLs (note – not links yet) for each data provider.

Interestingly, MajesticSEO has the highest total amount of detected source URLs. 1.4 million is about 100,520 (about 8%) more than Ahrefs, and that’s a significant number as it translates to about 1,005 per site. So where does the difference come from?  And what about Moz?  Moz have found the same sites but clearly have a different approach to crawling the sites they find (quality over quantity?).

In terms of root domain numbers it’s a similar story with MajesticSEO having the majority of Domains (76,033) and again, Ahrefs with the second highest number (65,798). Moz has fared much better this time round and has a total of (52,174) domains.

Let’s throw something else in the mix: Number of links!

One metric we are all concerned about as SEOs is that golden number of links. Moving on from the number of source URLs we can start comparing some link data.

A pattern is beginning to emerge. Majestic SEO has returned a whopping 2 million links in total (19% more than Ahrefs). What would this number be if we looked at 1,000 sites? There is a difference of 351,578 links between Majestic SEO and Ahrefs and that’s an amount that could hurt your link reports if they were to be missing. Whilst those two battle for the top spot, Moz comes in at a modest 247,876 links. That’s a very small number compared to the others and you might not unreasonably jump to the conclusion that this quickly rules out Moz as an effective source for link data. But let’s not be so hasty.

When you look at the ratio of links found per source URL then Moz does just fine.

 

It is evident that, of the URLs it chooses to crawl, Moz finds as many links as Ahrefs. Majestic SEO finds slightly more links per source URL, but this could be explained by the differences between how the providers handle link parsing (more on this later).

So let’s see what happens when you re-crawl these links to make sure they still exist.

Interesting. Around 40% of the links reported by both Majestic SEO and Ahrefs no longer exist. Most alarming is the number from Moz, where only 46% of their reported links actually exist and in total numbers that’s only 114K links from a possible 2.3 million live links. That’s a significant gap in numbers.  Please remember, this is a random sample of 100 sites and it’s worth waiting for the full 1,000 site study to determine if this is a consistent pattern.

To double check this analysis, we randomly selected 100 backlinks from each data set that were marked as ‘not live’ by our software and then manually checked the URLs to see whether we had missed something in our automated checks or whether there might be other anomalies caused by malformed HTML for example or differences in how links are parsed from the HTML.

To be honest, when we first ran this analysis the numbers looked too low and we found a few instances where we were not finding an exact matching link on the page.  For example, Moz were stripping out hyphens in anchor text (unless the anchor text was a URL) and the other providers and ourselves were not.  This meant that we were not finding an exact matching string and therefore declaring the link as ‘not live’.

Whilst we have gone through quite an extensive process to try and make this analysis as accurate as possible, we recognise that there could still be instances where we do not find a matching link because we parse a URL differently to a particular provider.  For this reason, we’ll go back to each provider to see whether we can clarify the different ways they parse the data.

We have also re-run the analysis to show URLs which have a live link to the same hostname.  This shows you that even though we have been unable to find an exact matching URL we could find a link on that page to the same hostname.  This indicates that, a) the link has changed; or b) our matching rules did not find an exact match and it is actually the same link; or c) there was actually more than 1 link on the page and the link we are looking for has been deleted.

Links change.  They get amended, archived, paginated onto different URLs and removed.  That’s why, even though all 3 providers do an amazing job of trying to maintain the freshness and quality of the link data, we re-crawl it.

Now that we know how many live links we have, why don’t we check how many are indexed by Google! Why? Because that’s our true indicator as to whether Google accepts these source URLs as valid web pages whose links may well be worth their juice. As a general rule of thumb, if Google has crawled a page and providing the page isn’t blocked by robots.txt or meta tags it should include it in its index. If not, then Google has deemed the page to be spammy, low quality content, duplicate content or from an untrusted domain. The list goes on… but you get the picture. If it’s not in the index, it’s much less likely that the links are worth anything.

I know what you are all thinking. Look at Moz! As a ratio, Moz detects better quality links than Majestic SEO and Ahrefs. Or does it? Does being indexed in Google mean the link from that page is then acceptable? Probably. Does it make it a good link? No. Not really. It just depends on what we call a good link.

However, given the index rate from each provider I think it’s safe to assume that the more links you crawl, and the deeper you dig, the more spam you find as a result. So having a lower index rate might not necessarily be a bad thing, especially as these are all the links that have been found pointing towards a given site; and not just links that have met Google’s quality guidelines and point towards a given site. As SEOs, this is exactly the sort of thing we need to know when analysing links (at any level), especially if you want to find (and disavow) bad links.

In this small data sample of 100 sites, Ahrefs and Majestic SEO have been very, very close in numbers. Why? Are they picking up the same links?

What happens when you look at Source URLs and filter these by uniqueness? In other words, how many unique URLs has each provider found that the others haven’t? This should be interesting…

My initial reaction was to say, “Wow! I need to be using all 3 data providers!” … That’s until I started adding up the numbers. Instinctively speaking, it just doesn’t feel right that 66% of Majestic SEO’s URLs are unique to it, 64% of Ahrefs’ URLs are unique to it and 36% of Moz’s URLs are unique to it.  I mean… it just doesn’t seem realistic, the number of common URLs should be much higher.

This is a key focus area in the forthcoming in-depth research study.  So if you are interested in seeing the detailed analysis, register for our newsletter so you can get notified as soon as it is released.

In Summary

In the next stage of the research, we will expand on the number of sites we are looking at to the full 1,000 sites. This should allow us to firm up our initial thoughts and draw better conclusions on the data presented so far.

Secondly, we will be verifying these raw link numbers at a much more granular level and trying to account for all the different reasons there might be duplicates and other anomalies, both at a URL level and link level. We’ve already seen how different providers handle parsing html slightly differently and we’ll be digging further into issues such as;

  • URL encoding
  • Case (in)sensitivity
  • Canonicalisation issues
  • The definition and composition of a link, e.g. Is a picture and text link in the same anchor tag one link or two?  It appears the providers have different views on this!

I believe this will all have a material effect on this analysis. Sorry, but you’ll just have to wait a few more weeks until it’s complete. So in the meantime, before it’s too late, please do comment below with any other suggestions you have or analysis you would like to see.

Just to re-iterate – crawling the web is hard and expensive and the task can only get bigger!

It is extremely difficult for any single provider to excel at all aspects of coverage, depth, freshness and accuracy, e.g. If you build as comprehensive an index of as many unique domains as possible then it is likely that the depth of your crawl or frequency of crawling might be compromised.

Quality Metrics

One area this research doesn’t touch on is the quality and importance of each provider’s authority metrics. For example, it appears that Moz are more concerned with link quality than total number of links. From speaking to the team at Moz, it appears they are using their metrics to help them with that evaluation and to assist in the filtering of spammy links.

In a future research study we might tackle this subject by correlating each provider’s data against Page Rank/rankings/total number of organic keywords (or something else?). We’d welcome some suggestions….

In the meantime, all that remains to be said is a big thank-you to Moz, Ahrefs and Majestic SEO for providing us with the API usage for this research; and hats off to them all for rising to the considerable challenges of crawling the web to provide an independent authoritative source of link data. Until Google gives us access to all our links via an enhanced Google Webmaster Tools API then look no further for your link data!

Majestic SEO API

Moz API
Ahrefs API

Stay tuned for the next piece in a few weeks’ time and do add your comments and insights below…

Update: Full review post

www.analyticsseo.com/blog/moz-vs-majesticseo-vs-ahrefs-part2

48
SHARES

17 Responses to Link Data Research – Majestic SEO (Fresh Index), Moz Linkscape and Ahrefs compared

  1. Anonymous says:

    This bit:

    “My initial reaction was to say, “Wow! I need to be using all 3 data providers!” … That’s until I started adding up the numbers. Instinctively speaking, it just doesn’t feel right that 66% of Majestic SEO’s URLs are unique to it, 64% of Ahrefs’ URLs are unique to it and 36% of Moz’s URLs are unique to it. I mean… it just doesn’t seem realistic, the number of common URLs should be much higher.”

    Really weird to think that each service is missing so many of the links coming in.

    Also be really interested to see if there’s any truth behind the ‘Are Moz picking up links that are a better quality’ implication that surfaces a couple of times in this report.

    Great report that makes very interesting reading – but I get the feeling you have just finished the preface, and are only just starting on Chapter One :)

    Reply
    • Alex123 says:

      Yes – this is indeed chapter one. There’s a lot more in-depth analysis to come and these stats will look somewhat different once we start to dig in a little more.

      Reply
      • Laurence says:

        I think Moz have focused from the outset on using their link metrics to help them make good decisions about which links to crawl; that’s not to say that Majestic SEO and Ahrefs don’t have many of these links… in many instances they do. It will be interesting to hear from Moz as to whether they plan to expand the size of their index in the future with a view to helping their users find spammy links pointing to their sites.

        [By the way if you want to comment on the blog and have your name appear - simply create a free account and login first. We're about to upgrade the blog to support social sign-in but it wasn't ready in time ;-(.....of course the benefit of this is you get a free account with your comment ;-) ]

        Reply
  2. Anonymous says:

    Great post BTW, what did you use to verify if the links were live? Also what did you use to verifiy they were in
    Google.

    Looking forward to your follow-up post!

    Reply
  3. nicholashgarner says:

    This is very interesting stuff. Obviously Moz doesn’t come out well here…. :-(

    also the unique URL’s thing is interesting. I suppose the question is how deep do these tools actually go. i.e. is the link depth like 4 times as deep as these tools are going? – all will be revealed I guess…

    I suppose the 3rd thing that comes to my mind is which of the tools has the most up to date and accurate information on a site’s links. From timeliness, we can depend on the data showing i.e. the link profile at the time of a penalty or a big lift in rankings. exact current link profile. Of course this is what Majestics fresh index attempts to do, so it would be interesting to compare freshness.

    another thing that comes to mind is the effectiveness of each providers metrics to show something meaningful. so forinstance we took 400 .co.uk domains, ran them through semrush’s UK database and then correlated the traffic numbers of these domains with majestic trust and citation flow figures. Trustflow had a 75% correlation with traffic. i.e. high TF = Google ranks that site. the post is here: http://90digital.com/blog/seo-comment/study-trust-flow-as-predictor-of-organic-traffic-4028.html

    anyhow this is fascinating!

    Reply
    • Laurence says:

      Thanks for the heads up Nick – we’ll check out your post. We’re planning to do some work on analysing freshness so watch this space.

      Reply
  4. Anonymous says:

    This is an incredible amount of data to sift through. Very interesting study comparison. As mentioned, it would be a lot different if all possible sites were diagnosed by this study. But is that not what the Search Engines do anyway. So, with that being said ’1000 sites” is a lot. Curious about the depth of links as well. Were all pages on these chosen sites crawled? If not, then they should be, because link depth is a major game player when link building. Fantastic research.

    Reply
  5. Anonymous says:

    Wow. This is so cool! I’m looking forward to part 2 of your analysis.

    A couple of things to bear in mind. Different tools are good for different things; and it’s important to make sure you’re always comparing apples to apples.

    For example, I have a tool preference for researching new link opportunities. And if I’m wanting to measure link growth over time, I compare ahref to ahref, Majestic to Majestic, and Moz to Moz.

    Reply
  6. Anonymous says:

    Great article. Have you run any of this analysis based on linking root domains instead of links? I did a similar analysis recently on a much smaller data set and found that due to canonicalization and other weird URL inconsistencies, linking root domains was a much more accurate way to compare across the providers. When I did this, I was getting much higher overlap rates than what you’re reporting. It’s relatively easy to parse out the domains – would love to see this in the next version!

    Reply
  7. I am very thankful for the very informative blogs and I extremely grateful that you perform this piece of writing very simply, I mean to say that it’s quite simple to read and understand.

    Reply
  8. Only 2,650 words? Must try harder! :)

    On a more serious note, fantastic and thorough article – have you done a similar analysis on root domains only, or perhaps is it something you might consider?

    Reply
    • Alexander AlbuquerqueAlexander Albuquerque says:

      Thanks for your feedback. Can’t possibly try any harder though ;)

      We do mention root domains here and use it as a ratio but this particular analysis was dedicated to links. We do have the data to produce something similar for root domains but it would never be to the same level of depth… as the amount of variables that could go wrong (or right) when crawling root domains isn’t huge. We have total root domains per provider in this analysis… are you referring to uniqueness/commonality between providers?

      Reply
  9. It is really an interesting and more attractive post but sure it is not complete but a great way to discuss with such tools. I really appreciate your efforts to do this comparison and looking forward for the next part of this post with the best detail same like this one.

    Reply
  10. Amazing information you have shared and you would have to work hard for gathering such a useful information. It is very interesting to read that Majestic SEO has the highest total amount of detecting URLs as compare to Ahrefs and MOZ according to the table that you have shared and I am thinking that the strategy of theses tools will change. Thanks for sharing this awesome information.

    Reply

Get involved...

Scroll to Top
css.php