SEO Chapter 2: The Importance of Good Site Architecture

Before you start examining a website from this level, let me explain the importance of good site architecture.

While writing this book I am working with a large client that is totally befuddled by its poor rankings. (Note: This client had me sign a nasty looking non-disclosure agreement, so I am unable to reveal its name.) The company's homepage is literally one of the most linked-to pages on the entire Internet and at one point had the elusive PageRank 10. One of its current strategies is to leverage its homepage's link popularity to bolster a large group of pages optimized for ultra competitive keywords. It wants to cast a wide net with the optimized pages and drive a large amount of search engine-referred traffic to its product pages.

It is a great idea, but with the current execution, it has no chance of working.


The problem is that the website lacks any kind of traditional site architecture. The link juice (ranking power) coming from the hundreds of thousands of domains that link to this company's homepage has no way of traveling to the other webpages on this domain. All of the link juice is essentially bottled up at the front door.

Its content is located on at least 20 different domains, and there is no global navigation that leads users or search engines from the homepage down to categorized pages. The company's online presence is more like a thousand islands rather than the super continent it could be. It is an enormous waste of resources and is directly affecting the company's bottom line in a real way.

When explaining site architecture to clients, I start out by asking them to visualize a website like an ant hill. All of the chambers are like webpages and the tunnels are like internal links. I then have them imagine a little boy pouring water into the ant hill. He pours it down the main entrance and wants to have it fill all of the chambers. (As a side note, scientists actually have done this with cement to study the structure of ant metropolises. In one case, they had to pour 10 tons of liquid cement into an ant hill before it filled all of the chambers.) In this analogy the water represents the flow of link juice to webpages. As discussed earlier, this link juice (popularity) is essential for rankings.

The optimal structure for a website (or ant hill, if you must) would look similar to a pyramid .

This structure allows the most possible juice to get to all of the website's pages with the fewest number of links. This means that every page on the website gets some ranking benefit from the homepage.



Apyramid structure for a website allows the most possible link juice to get to all the website's pages with the fewest number of links.

NOTE Homepages are almost always the most linked-to pages on a domain. This is because they are the most convenient (the shortest) URL to link to when referring to the website online.

Evaluating Homepages

Now that we are on the same page about site architecture, we can move forward. Once I get to this level of analysis, I start really looking at the site architecture. Obviously, this starts at the homepage.

Ideally, the homepage should link to every single category of pages on a website. Normally, this is accomplished with a global navigation menu (global meaning it is on every web page on the domain). This is easy to do with small websites because if they have less than 150 pages, the homepage could directly link to all of them. (Note this is only a good idea if the homepage has enough links pointing at it to warrant this. Remember the little boy and the ant hill; link popularity is analogous to the amount of water the little boy has. If he doesn't have enough, he can't fill every chamber.)

SEO Chapter-2: Robots.txt and Sitemap.xml

After analyzing the domain name, general design, and URL format, my colleagues and I look at potential client's robots.txt and sitemap. This is helpful because it starts to give you an idea of how much (or little) the developers of the site cared about SEO. A robots.txt file is a very basic step webmasters can take to work with search engines. The text file, which should be located in the root directory of the website (http://www.example.com/robots.txtV is based on an informal protocol that is used for telling search engines what directories and files they are allowed and disallowed from accessing. The inclusion of this file gives you a rough hint of whether or not the developers of the given site made SEO a priority.

Because this is a book for advanced SEOs, I will not go into this protocol in detail. (If you want more information, check out http://www.robotstxt.org or

http://googlewebmastercentral.blogspot.com/2008/06/improving-on-
robots-exclusion-protocol.html.) Instead, I will tell you a cautionary tale.

Bit.ly is a very popular URL shortening service. Due to its connections with Twitter.com, it is quickly becoming one of the most linked websites on the Web. One reason for this is its flexibility. It has a feature where users can pick their own URL. For example, when linking to my website I might choose http://bit.lv/SexyMustache. Unfortunately, Bit.ly forgot to block certain URLs, and someone was able to create a shortened URL for http://bit.Iv/robots.txt. This opened up the possibility for that person to control how robots were allowed to crawl Bit.ly. Oops! This is a great example of why knowing even the basics of SEO is essential for web- based business owners.

After taking a quick glance at the robots.txt file, SEO professionals tend to look at the default location for a sitemap. (http://www.example.com/sitemap.xml). When I do this, I don't spend a lot of time analyzing it (that comes later, if owners of that website become a client); instead, I skim through it to see if I can glean any information about the setup of the site. A lot of times, it will quickly show me if the website has information hierarchy issues. Specifically, I am looking for how the URLs relate to each other. A good example of information hierarchy would b e www.example.com/mammal/doas/enalish-sprinaer-spaniel.html.

whereas a bad example would be www.example.com/node? tvpe=6&kind=7. Notice on the bad example that the search engines can't extract any semantic value from the URL. The sitemap can give you a quick idea of the URL formation of the website.

URLs like this one are a sign a website has information hierarchy issues because search engines can't extract any semantic value from the URL.

Action Checklist

When viewing a website from the 100-foot level, be sure to take the following actions:

• Decide if the domain name is appropriate for the given site based on the criteria outlined in this chapter

• Based on your initial reaction, decide if the graphical design of the


website is appropriate

• Check for the common canonicalization errors

• Check to see if a robots.txt exists and get an idea of how important SEO was to the website developers.

• If inclined, check to see if a sitemap.xml file exists, and if it does, skim through it to get an idea of how the search engines might see the hierarchy of the website.

This section dealt with some of the first elements of a site that I look at when I first look at a client's site from an SEO perspective: domain name, design, canonicalization, robots.txt, and sitemaps. This initial look is intended to just be a high-level viewing of the site.

In the next section I focus on specific webpages on websites and take you even closer to piecing the SEO puzzle together.

SEO Chapter 2: Duplication and Canonicalization

After analyzing a website's domain name and general design, my


colleagues and I check for one of the most common SEO mistakes on the Internet, canonicalization. For SEOs, canonicalization refers to individual webpages that can be loaded from multiple URLs.

NOTE In this discussion, "canonicalization" simply refers to the concept of picking an authoritative version of a URL and propagating its usage, as opposed to using other variants of that URL. On the other hand, the book discusses the specific canonical link element in several places,

including in Chapter 5.

Remember that in Chapter 1 I discussed popularity? (Come on, it hasn't been that long.) What do you think happens when links that are intended to go to the same page get split up among multiple URLs? \fc>u guessed it: the popularity of the pages gets split up. Unfortunately for web developers, this happens far too often because the default settings for web servers create this problem. The following lists show the negative SEO effects of using the default settings on the two most common web servers:

Apache web server:

http://www.example.com/

http://www.example.com/index.html http://example.com/

http ://example.com/index. html Microsoft Internet Information Services (IIS):

http://www.example.com/

http://www.example.com/default.asp (or ,aSPx depending on the version) http://example.com/

http://example.com/default.asp (or .aspx)

Or any combination with different capitalization.

Each of these URLs spreads out the value of inbound links to the homepage. This means that if the homepage has 100 links to these various URLs, the major search engines only give them credit separately, not in a combined manner.

NOTE Don't think it can happen to >ou? Go to http://www.mattcutts.COm and wait for the page to load. Now, go tohttp://mattcutts.com and notice what happens. Look at that,


canonicalization issues. Whafs the significance of this example? Matt Cutts is the head of Google's web spam team and helped write many of the algorithms we SEOs study If he is making this mistake, odds are your less informed clients are as well.

Luckily for SEOs, web developers developed methods for redirection so that URLs can be changed and combined. Two primary types of server redirects exist—301 redirects and 302 redirects:

• A 301 indicates an HTTP status code of "Moved Permanently."

• A 302 indicates a status code of "Temporarily Moved."

Other redirect methods exist, such as the meta refresh and various JavaScript relocation commands. Avoid these methods. Not only do they not pass any authority from origin to destination, but engines are unreliable about following the redirect path.

TIP You can read all of the HTTP status codes at

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html.

Though the difference between 301 and 302 redirects appears to be merely semantics, the actual results are dramatic. Google decided a long time ago to not pass link juice (ranking power) equally between normal links and server redirects. At SEOmoz, I did a considerable amount of testing around this subject and have concluded that 301 redirects pass between 90 percent and 99 percent of their value, whereas 302 redirects pass almost no value at all. Because of this, my co-workers and I always look to see how non-canonicalized pages are being redirected.

It's not just semantics. How a page is redirected (whether by a 301 or a 302 redirect) matters.

WARNING Oder vBrsions of IIS use 302 redirects by default. D'oh! Be sure to look out for this. You can see worthless redirects all around

popular NS-powered websites like microsoft.com and

myspace.com. The value of these redirects is being completely negated bya single value difference!


Canonicalization is not limited to the inclusion of letters. It also dictates forward slashes in URLs. Try going to http://www.google.com and notice that you will automatically get redirected to http://www.aooale.com/ (notice the trailing forward slash). This is happening because technically this is the correct format for the URL. Although this is a problem that is largely solved by the search engines already (they know thatwww.google.com is intended to mean the same as www.aooale.comI), it is still worth noting because many servers will automatically 301 redirect from the version without the trailing slash to the correct version. By doing this, a link pointing to the wrong version of the URL loses between 1 percent and 10 percent of its worth due to the 301 redirect. The takeaway here is that whenever possible, it is better to link to the version with the forward slash. There is no reason to lose sleep over this (because the engines have mostly solved the problem), but it is still a point to consider.

CROSSRB1 The right and wrong usage of 301 and 302 redirects is discussed in Chapter 3. The correct syntax and usage of the canonical link element is discussed in Chapter 5.