A standard way to search google for pages within a specific site is to enter ‘site:www.yoursite.com’ into a google as search parameter. This will return only results from the site you specified. Omitting any additional search terms will return all pages of the site as your result. In theory this should give you an idea of how many pages google has indexed on your site.
How Many Pages?
Achk, spit – your kidding me, 47,400 results!
47,400 results astounded me. I knew this could not be right so I started looking through results. It turns out many of these results ended up being files that were once indexed but now blocked by a robots.txt file and should drop off of the google search results in due time. A majority of those file were compressed and resized images that google crawled. Other results were paged items such as the prayer wall that simply do not need to be indexed for search, robots.txt to rescue once again.
The simple google site search revealed very important information that I would have not been able to easily find otherwise. I was completely surprised on how many pages i found indexed on google. I will be doing this more often.
Google Webmaster Tools
Since I was unable to get a solid result the “quick” way. I decided to login into Google webmaster tools and see what information it has on indexed pages.
Part of our SEO strategy is to submit a sitemap xml file to google to help with crawling and discovery. The XML sitemap is custom generated using only specific areas of our site, including most areas that we want indexed in google searches,like content pages, blogs, events, etc..
The results I see in Webmaster tools show that we are submitting 573 unique pages for indexing on Google. While this number is closer to what I expected it still does not seem accurate as there are quite few more pages that I am positive that we use, most are dynamic and personal in nature such as user profiles, or job applications but others like events should be included in our total page count.
Google Analytics
In my attempt to collect more information about how many pages my site contains I turned to another google product, Google Analytics. If you do not currently use a analytics tool for your website, do yourself, or you clients and favor and set it up, super quick and easy, and Google Analytics is even free.
After logging in I set my date period to the last 6 months of data the drilled down the “Behavior > Site Content > All Pages” which displays a lists of all accessed pages. I then set my shown results to the max of 5000 found at the end of the table with all the page URL’s.
One handy feature of all the data tables within Google Analytics is the ability to search the results. This allows you to display only results that match url titles you are looking for. An even nicer tool is the ability to exclude or include filter content on the page. You can do this by clicking the ‘advanced’ link next to the tables search box.
Adding a filter is easy; select your dimension(s) and if you would like to include or exclude in your results. In the example I knew that I didn’t want and paged prayer pages to show.
After excluding the all of the URL’s I did not want to count as pages I got a number that i feel is close, 794.
404 Issues
I did find a few other things interesting while looking at the Analytics data. I have a handful of pages that load the header and footer of my site, but loads not content. This led me to find out some of my 404 page structures are incorrect and should be verified as working properly when a user accesses a page that doesn’t exist.
The process of finding out how many pages existed on my site turned out to be a much longer but more valuable adventure than I originally planned, but worth every minute I spent on it.