Link: Matt Cutts: Gadgets, Google, and SEO Indexing timeline.
OK, so I came late to the party because I've been focusing in the past few years on content instead of technology. I usually manage to stay up enough on things to keep up, and noticing my pages falling out of Google indexes in these two previous posts, I started doing some research on the topic tonight. Found a ton of great information in the tech blogs, where I hadn't been lurking in quite a while.
Some of the links I've been reading are over in my Intriguing list on the side menu, and I may add more in here later, but here's what I gleaned in a nutshell:
Google launched a big new data hoover called "Big Daddy." Big Daddy went out slowly at the first of the year, but was fully launched by March 2006. Big Daddy plays a bit different with the Web than the old daddy did, I guess.
Many folks who live in this odd netherworld called "Search Engine Optimization" or SEO do nothing but obsess on making Google like them. I do understand this obsession, although I do not share it. I've always seen people who promise SEO stuff as rip-off artists who try to sell unnecessary services to gullible business owners wanting to hang out a web shingle. I usually counsel those folks to forego SEO prices and just do a few things consistently, like build a good site and have things of value to say, and keep their money for something worthwhile. If SEO is rocket science, then the tail is wagging the dog, and something is very wrong in the world.
However, the SEO-obsessors (like those who aspired to be the most popular kids in high school) did teach me something valuable tonight. I too have seen the Google sandbox in action, and all this time I've been puzzled by the odd behavior of Google. Sandbox is a good term. Now I know what to call it. It's a holding place where baby sites, new sites, are put until Google can decide if they are legit, have staying power, or are link farms and spam.
I've been building web sites since 1993, and the rules have changed so many times since then, I like to focus on the constants, the values that remain even as the game changes. So yeah, I played my own personal SEO game. I've never made a junk site, but I'm well-read enough to know there are a lot of things that make viable web sites that those chasing the momentary fad are unaware of. I try to focus on those things, put them up on sites, and observe what the search engines do. Like my dissertation. Like the Xenaverse itself. Like the blogosphere, as it shifted from tech blogs to security blogs to political blogs and citizen journalism efforts.
Which doesn't mean I'm not playing with interactive poetry, creative hyper-non-fiction, image-driven sites, or developing all manner of things behind my firewalls.
So with the latest sites I've launched, I noticed the Google Sandbox. Where Google picked up my sites right off in the past, suddenly new sites on the same domain are being virtually ignored, despite being timely, pulling comments from some big name bloggers, and containing substantive content with specific proper noun keywords. Because I've got better things to do than obsess on SEO, I just sort of said, "Hmph," and went on my way. Now it's starting to make sense. Not that I agree with the Sandbox, but at least I have a name for what I've been seeing.
But that still doesn't explain why back archive sites that have inbound links from all kinds of interesting neighborhoods, comments that enlighten, correct, and debate, and traffic, are disappearing into a Google Black Hole.
Matt Cutts is a pretty cool Google guy, and he really tries to explain all this stuff, but one of the commenters on this blog made the point so well, I just have to quote it here (it's most interesting to me that the commenter I like best comes from a background in online fandom cultures):
Link: Matt Cutts: Gadgets, Google, and SEO Indexing timeline.
Nancy Said,
Matt,
First, I appreciate you maintaining this blog and responding to some of the comments.
I realize you can’t analyze every site, but from what I’ve seen at Webmaster World, the sites you have picked are not very representative of the sites which are having problems with the supplemental index and not being crawled. The sites you have picked are obvious offenders, but sites such as my own and many others have none of these issues. To us, it seems that building a site to the best of one’s ability isn’t good enough; unless you can play the Google game, you’re out of luck. For instance, the inbound link issue. There are only a couple active fansites related to mine (most are no longer updated, and my site is only a few months old). Therefore, I am stuck with a couple inbound links unless I try to contrive inbound links, which I have no desire to do. Of course, the related sites also naturally link back to me - I’m related to them too, after all! Now that’s bad? It’s quite a Catch 22.
I think one should hesitate to imply that all the websites with supplemental problems “deserve it” because they’re all doing something so terribly wrong that they no longer are recognized by the index. There are many sites which do not fit into this penalty schema that have lost pages - too many to blow off as abberations in an otherwise successful change.
But frankly I am more concerned with the fact that so many pages with good content are being ignored. If I were #105 for my keywords but could look at site:[my site] and see that my pages are indexed, I would be OK with that. At least they’re there, and people who are looking for content unique to my site can find it. However, now, according to Google, only 7 pages on my site are searchable for the average Google user - only seven pages of my site exist in Googleland. I can put exact phrases from supplementally indexed pages in the search engine and get no results returned. With almost nothing indexed, I feel like all my honest efforts are worthless to Google for some mysterious reason.
Yes, it’s your search engine and you may do what you like. However, I’m sure you understand that a search engine that throws out good content is not doing its job. Hopefully, you will not shrug off the numerous legitmate concerns because you were able to find in the vast array of e-mails you received some egregious offenders.
So as I was writing this post, I had another thought. Google is at a choke point of power, literally shaping the virtual landscape of cyberspace by holding the Keys to the Kingdom.
So while this is clearly a bad thing for those of us who need Google to be a reliable source of the Library of Everything, who want indiscriminate indexing and link parsing like Nancy above, who want an uncorrupted data set delivered in the results, IF Google's data set has become corrupted with the advent of Big Daddy, perhaps it means people will be driven to more reliable search engines and the stranglehold on power Google has will finally be broken.
On the other hand, I am starting to suspect that two parallel universes are in operation here: Google crawling and Google indexing. Google's crawling bots are still utter Hoovers, sucking up everything in their path. Reading the tech blogs convinced me of that. SEO folks are in the habit of checking their crawls, and they'd be screaming even more to high heaven if Google were not spidering the Known World in its Entirety, even gmail content etc.
But the Google indexes appear to be operating from a different data set, or different assumptions within the exclusive and secret Google data set. Is there one set of data that Google uses and parses and fine tunes into secret sauce for its own business purposes, just like the NSA would find "uses" for the data it sucks up indiscriminately as well?
And just like with China, is Google creating a separate and unequal parallel universe of indexing data that it allows the public, APIs, and other groups to see, a non-proprietary Google data set?
Who gets to search the REAL and uncorrupted Google data? It seems the rest of us get Big Daddy, but is he sloppy seconds? Those that aren't consigned to the Sandbox, or worse, the bottomless pit with much weeping and gnashing of teeth?
I'm just speculating, of course. Who can see into the secret heart of Google? I don't want Google to be evil more than the next person. And I'm not saying for certain it is.
This poster on the same site also made an interesting point, as I've been victimized by these "scraper sites" as well, but playing copyright police won't stop it (they could just shift the material to a mirror site as fast as blog comment spammers change IP addresses), and hiding the sites by erasing them from the Google index only makes it harder for legit sites to find and protest the plagiarism.
Think of it in terms of college term paper theft. You put your paper online because your teacher requires it. The paper gets hoovered up by some paper mill sites that start selling it to cheaters for a profit. You could try to stop them, but Google blacks them out of the universe, so you don't even know you're being ripped off.
I'll say it again: the problem is giving Google the Keys to the Kingdom in the first place, because it apparently can literally erase something from cyberspace and render it invisible. Add to that collusion with countries like China (I bet there's politicians in the US who drool over the power Google is allowing the Chinese government), the telcom's bid to turn the Internet into a giant toll road for their cronies, and can fascism be far behind?
Jeff Said,
May 16, 2006 @ 6:36 pm
Matt,
I know google is not giving us webmasters a full picture with the link command. I did the link command on yahoo and msn and I noticed some scraper sites copied my content and added some links to a few of my websites. I have a feeling google is looking at these links as questionable. I am in the process of emailing these scraper sites webmasters and getting the links removed because I did not request to put them there and they violated copywrite by taking our content.
Since google crawls better than msn and yahoo, will there be a way in the future for us webmasters to see these links? Honestly right now if a competitor wants to silently tank a websites rankings in google all they need to do is drop a bunch of bad links. Without google giving us webmasters the ability to see the links we may never even know this could happen.