We here at milMedia Group were doing some repair work on a site that had what was believed to be a protected area that wasn’t, and sensitive data was crawled and reported by Google in search results. Getting the document offline is the simple part, but Google has a memory with its Google cache and quick view features that could leave your document available for viewing for days or weeks before Google crawls your site again and cleans up the links. Here are some tips that may help you prevent that type of problem, or what you can do when it has happened and you are trying to pull it back in a timely manner.
First off, one of the challenges of establishing a new website is the matter of getting the word out as fast as possible. I usually find myself submitting new sites to a variety of crawlers, spiders and search engines to improve the likelihood of it being found, and of course, none is quite as big as Google. So, ensuring the right place using the proper search engine optimization (SEO) tactics is usually high on my list, and restricting crawlers and bots is not always high on the new site owners list. And, if you start off small with a few pages and then grown, you have to continue to assess what you need to protect, and how. In this case, a misconfigured module restricted access due to login/password combination but allowed a directory to be crawled without requiring credentials. From there is where our triage and actions began.
Deleting a Cached Google link
Foresight being what it is, one way to keep your pages out of the Google cache is to use the meta tag NOARCHIVE. You can read this Lifehacker blog post for more details on that feature, but if your pages change frequently or if your content is something you do not want cachable, this is a tag for you. Just add it before the closing </head> tag on any web page that you don’t want to ever be cached.
|<META NAME=”ROBOTS” CONTENT=”NOARCHIVE”>|
That meta tag will allow you to keep your documents out of the cache on some search sites like Google, but what about if you do not choose to use the tag and you are now being reactive to an incident like the one we were faced with? The instructions here are for site owners and webmasters, but if you have sensitive information that you need to be removed from a site that you do not own, the procedures are a bit different.
Before you do any of these steps, at least for Google, you need to remove the document or file on your site and the offending link must come up with a 404 error before they will do these steps. A “nuclear” option would be taking down the entire site, but a better response might be removing the offending directory and even that might be extreme for your situation. Another method is by creating or editing your robots.txt/ file, a special set of instructions that guide web crawlers and bots like Google. Robots.txt/ can restrict a path, file or directory on your site, and is something you should have anyway. If you’ve updated a page on your site and want Google to remove the outdated cached version from search results on their next scan, just use some simple code like this in your file:
Some of those techniques work great for page updates you just want to be cleaned up and time isn’t as critical as personal or sensitive data would be, so for those where speed is important Google has a URL removal tool for site owners to reach into the Googleplex and possibly influence the search results. Google says this is not for everyday changes, but that “The URL removal tool is designed for URLs that urgently need to be removed, such as URLs that accidentally expose confidential data.” I do not think they are going to respond the same for lesser offenses and your results may vary.
To use this you go to your account and login to verify your ownership of the site in Google’s Webmaster Tools. If you don’t have one now is the time to set one up, it is a tool all site owners should have in your kitbag anyway. Once you get on the Webmaster Tools home page, either enter your site URL or click the site URL if you have one established. Click Site configuration on the left navigation in your dashboard, then click on the Crawler access link. From there click Remove URL, and New Removal Request and type the URL of the page you want to be removed.
According to Google “the URL is case-sensitive—you will need to submit the URL using exactly the same characters and the same capitalization that the site uses.” After that, you select Remove page from cache only and verify you have completed the requirements I mentioned above, then click Submit Request. Now from your panel, you can monitor the status of your request, something you need to continue to check until it’s resolved.
Since the details involved a person other than the site maintainer, we also assisted them in reporting the offending link to Google as not being the owner of the site. This way we had two separate requests on the same link. For the first few hours, the link was visible, even after multiple browser refreshes and cache cleanings. To Google’s credit within 11 hours of reporting, the offending link was no longer available to be found. In this case, it wasn’t so much as a cached version since it was a .doc-type file the offending Google feature was the Quick View cache. The one important note was to find the offending document you had to type a specific combination of keywords, of which brought the document to the top of search results, something one might not be lucky with in the future.
Steps were taken to protect sensitive data, and since to make security redundant, but one cannot be too careful. I personally will reconsider the deliberate decision to exclude the NOARCHIVE meta tag, a simple solution to keep pages from ever being archived. But without a doubt besides the obvious part of validation configuration, the next layer that would have prevented this would have been a solid Robots.txt/ file that excluded the sensitive directories. Feel free to add any comments below to how you may have handled this or similar situations.
–Dan Elder, milMedia Group