![]() |
|
|
#1 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
23·13·23 Posts |
A while ago, our robots.txt file blocked everything except for Google, but it is now blocking Google again as well.
Was there a reason for this change? I feel that this is unfortunate, since we cannot get much publicity this way. Google does cache almost every page it crawls, but the cached pages do not last forever. The cached pages from this forum will probably be gone the next time Google tries to crawl us. |
|
|
|
|
|
#2 |
|
"Mike"
Aug 2002
202016 Posts |
When we allow Google permission to index the forum, a bunch of other search engines come in and put a major strain on the server.
We've checked the syntax multiple times for letting only Google in.
|
|
|
|
|
|
#3 |
|
∂2ω=0
Sep 2002
República de California
1163910 Posts |
Hey, Mike: in the following how-to-set-up-robots.txt page :
http://www.thesitewizard.com/archive/robotstxt.shtml I note this quote: "Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file (if you are running your site on an Apache server)." Is that only relevant to directory names, or could it also be for search-engine-specific in|exclusion stuff? [Of course if you don't use Apache it's moot.] |
|
|
|
|
|
#4 |
|
"Mike"
Aug 2002
25×257 Posts |
Of course we run Apache.
![]() The .htaccess stuff is, from what we understand from your link, just password protecting directories. (There are other .htaccess functions, like redirecting stuff, and blocking IP addresses, but we don't think those apply to this discussion.) For example, we use a .htaccess/,htpasswd deal on our administrative pages, so those never get indexed, but we can't think of a way to allow Google through that kind of system because it requires a password. (Plus, the users would need the password too!) If there is a way to block search engines this way we are, like the Ferengi, all ears. |
|
|
|
|
|
#5 |
|
"Mike"
Aug 2002
202016 Posts |
For the record, the issue isn't even bandwidth. We can handle a lot more bandwidth than we use currently.
The problem is when a rogue search engine comes in it sometimes tries to index the whole site in one swoop, so the database server gets overloaded. Google indexes very slowly and uses very little resources. And Google gives us a control panel to control the spider's activity. We wish there was a way to coordinate with Google to index at certain times, and then we could make a cron job to swap the robots.txt file with a Google-friendly one, at certain times of the day, and then swap it back after a short period of time. We're sure there is some sort of solution out there. |
|
|
|
|
|
#6 | |
|
Bamboozled!
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across
10,753 Posts |
Quote:
If so, block them at the firewall, letting Google and users through unscathed. Paul |
|
|
|
|
|
|
#7 | |
|
"Mike"
Aug 2002
200408 Posts |
Quote:
(It is possible to spoof this data.) We have all the logs going back a long time but it is a mountain of data. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Anyone want a Google+ invite? | chalsall | Lounge | 2 | 2011-07-29 19:31 |
| mprime 24.14 blocked indefinitely | Aillas | Software | 29 | 2005-11-23 17:27 |
| Google Unto Others... | ewmayer | Soap Box | 1 | 2005-08-09 14:13 |
| Were on google | moo | Lounge | 11 | 2005-01-28 14:47 |
| Google Ads... | Xyzzy | Lounge | 61 | 2004-12-25 02:20 |