![]() |
![]() |
#1 |
Bemusing Prompter
"Danny"
Dec 2002
California
2×29×43 Posts |
![]()
A while ago, our robots.txt file blocked everything except for Google, but it is now blocking Google again as well.
Was there a reason for this change? I feel that this is unfortunate, since we cannot get much publicity this way. Google does cache almost every page it crawls, but the cached pages do not last forever. The cached pages from this forum will probably be gone the next time Google tries to crawl us. |
![]() |
![]() |
![]() |
#2 |
Aug 2002
2×7×13×47 Posts |
![]()
When we allow Google permission to index the forum, a bunch of other search engines come in and put a major strain on the server.
We've checked the syntax multiple times for letting only Google in. ![]() |
![]() |
![]() |
![]() |
#3 |
∂2ω=0
Sep 2002
Repรบblica de California
267538 Posts |
![]()
Hey, Mike: in the following how-to-set-up-robots.txt page :
http://www.thesitewizard.com/archive/robotstxt.shtml I note this quote: "Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file (if you are running your site on an Apache server)." Is that only relevant to directory names, or could it also be for search-engine-specific in|exclusion stuff? [Of course if you don't use Apache it's moot.] |
![]() |
![]() |
![]() |
#4 |
Aug 2002
2·7·13·47 Posts |
![]()
Of course we run Apache.
![]() The .htaccess stuff is, from what we understand from your link, just password protecting directories. (There are other .htaccess functions, like redirecting stuff, and blocking IP addresses, but we don't think those apply to this discussion.) For example, we use a .htaccess/,htpasswd deal on our administrative pages, so those never get indexed, but we can't think of a way to allow Google through that kind of system because it requires a password. (Plus, the users would need the password too!) If there is a way to block search engines this way we are, like the Ferengi, all ears. |
![]() |
![]() |
![]() |
#5 |
Aug 2002
216A16 Posts |
![]()
For the record, the issue isn't even bandwidth. We can handle a lot more bandwidth than we use currently.
The problem is when a rogue search engine comes in it sometimes tries to index the whole site in one swoop, so the database server gets overloaded. Google indexes very slowly and uses very little resources. And Google gives us a control panel to control the spider's activity. We wish there was a way to coordinate with Google to index at certain times, and then we could make a cron job to swap the robots.txt file with a Google-friendly one, at certain times of the day, and then swap it back after a short period of time. We're sure there is some sort of solution out there. |
![]() |
![]() |
![]() |
#6 | |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
61·191 Posts |
![]() Quote:
If so, block them at the firewall, letting Google and users through unscathed. Paul |
|
![]() |
![]() |
![]() |
#7 | |
Aug 2002
2·7·13·47 Posts |
![]() Quote:
(It is possible to spoof this data.) We have all the logs going back a long time but it is a mountain of data. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Anyone want a Google+ invite? | chalsall | Lounge | 2 | 2011-07-29 19:31 |
mprime 24.14 blocked indefinitely | Aillas | Software | 29 | 2005-11-23 17:27 |
Google Unto Others... | ewmayer | Soap Box | 1 | 2005-08-09 14:13 |
Were on google | moo | Lounge | 11 | 2005-01-28 14:47 |
Google Ads... | Xyzzy | Lounge | 61 | 2004-12-25 02:20 |