Is Google Broken?
3 September 2004Between Aug 04, 2003 and Aug 25, 2003 (just 21 days), Google added a little over 1.2 billion Web pages to their index. But since Aug 25, 2003 and today, Google hasn't added one single Web page to their index (at least according to Google they haven't).
Today, Google's home page states:
2004 Google - Searching 4,285,199,774 web pages
Now let's look at the history of Google's home page using "web.archieve.org" (they archive Web pages which you can review on-line as they were back when):
Aug 25, 2003
2003 Google - Searching 4,285,199,774 web pages
Now isn't this strange? The exact same number of Web page one year ago as it is today and next week will be the same. Let's go back and see when this number was different:
Aug 04, 2003
2003 Google - Searching 3,083,324,652 web pages
So what does this mean? It means either Google is lying to us all or they have been dropping as many pages as they have been adding them.
My guess is that in Aug 25, 2003 Google's index was full. Why do I say this? Because Google's white papers were freely available to anyone. This meant that you could access the actual documents publish by Google founders before Google became public and get a glimpse of how Google was created. According to these documents, Google was written in C and C++ using ANSI C and Linux. The database was constructed using a Document_ID that is associated with each Web page. This document_ID was published as being a 4-byte unsigned long integer. This means that for every single Web page Google has in their index, an ID was created to identify this Web page. But like everything, there is a limit and a 4-byte unsigned long integer has a maximum value of 4,294,967,296. So if no changes are made to their database structure, it would mean Google has probably reached this threshold. And as new pages are added, old pages are removed (disappear). Quite alarming isn't it?
So Google may have a serious flaw in their database structure and design. Google has used an 4-byte unsigned long integer to store the document ID (every page in Google's index). In Linux (which is what Google uses), this variable is 4-bytes long, and has a maximum of 4.2 billion (4,294,967,296) before it rolls over to zero. This may also be one of the reasons pages appear to be dropping from Google's index at an alarming rate (tens of thousands of search results where I can prove this is happening). They may have already run out of space and the document_ID is no longer associated with the content stored in the database which in turn will return empty results for a particular URL.
Can this problem be corrected? Sure it can, but Google has 15,000+ Linux servers and 4.2 billion document_IDs to convert. This is not going to be an easy task at this point. Also every single word in their inverted index is associated with a document ID so the conversion will probably take months if not a great deal longer.
In addition to this major problem, there are other major flaws with Google. One of these is with their PageRank algorithm.
According to a recent study, 75% of keyword searches on the Web are handled by Google. First off let me say that while Google may indeed handle 75% of keyword searches, you also have to consider how many of these people looked elsewhere as well. Yahoo! claims an Internet reach of over 80% so they too are handling these same requests, but probably delivering better less biased results.
Given that Google returns currently "popular" pages at the top of search results, only proves Google is unfairly penalizing newly created pages that are not yet "popular". While this statement may be an exaggeration, it does contain an alarming bit of truth. To find a web page, many users go to Google (or another search engine using Google's index like AOL) and issues a keyword query.
If the users cannot find relevant pages after several different keyword queries, they are likely to give up and stop looking further. So a Web page not indexed by Google or ranked poorly by Google (low PageRank - poor Google popularity) will not likely be viewed by many users. And because of this will never become popular according to Google's own admissions.
While Google takes more than 100 different factors into account in determining the final ranking of a Web page, the core of their ranking algorithm is based on a metric called PageRank. PageRank is nothing more than a "link popularity" metric, where a page is considered more important if the page is linked by many other pages on the web that Google also considers important (popular and already in Google's index). Google puts a page at the top of search results that contain the keywords the searcher is looking for or by keywords found in the anchor text of those pages linking to it. The more popular the links pointing to this Web page, the more popular this Web page will be. So the popular continue to get "more" popular and the less fortunate ones that are new to the Web continue to be held back from this popularity game Google is playing.
It is important to understand the distinction between the "importance or quality" of a Web page and the relevance of "popularity". What do you want I ask? Do you want to see the same old popular sites day in and day out or would you like to see relevant content rich newly discovered Web pages? As long as you continue to use Google you will be promoting this popularity game and your competition will continue to rise above you not to mention you will be missing out on the new stuff.
Since popular pages are repeatedly returned by Google as top results, they are also the easiest for users to discover, which increases their popularity even further. In contrast, a currently "unpopular" page is often not returned by Google, so few new links will be created to the page, keeping the page's ranking down. This "rich-get-richer" scheme can and does destroy the quality of search results.
PageRank is an unfortunate algorithm for both users and Web page authors and useful information is being ignored by Google simply because a new page or site has not had a chance to get noticed and under the PageRank algorithm, will never get noticed.
A recent article at motleyfool.com stated that 98% of Google's revenues come from their advertisers. This would mostly consist of Adwords and Adsense. But all it would take is a firewall company, Virus protection company, AOL or Microsoft to simply create a Google ad blocker and it will be the end of Google over night. These companies as well as Google already provide pop up and pop under blockers and writing a Google ad blocker would be even more simple to do.
I have months of research to prove my statements.
Just my two cents for today! Anthony Federico
Source: W3Reports
Most popular searches for M
|