mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Forum Feedback

Reply
 
Thread Tools
Old 2008-01-29, 05:57   #1
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

2×32×137 Posts
Default Google blocked again

A while ago, our robots.txt file blocked everything except for Google, but it is now blocking Google again as well.

Was there a reason for this change?

I feel that this is unfortunate, since we cannot get much publicity this way. Google does cache almost every page it crawls, but the cached pages do not last forever. The cached pages from this forum will probably be gone the next time Google tries to crawl us.
ixfd64 is offline   Reply With Quote
Old 2008-01-29, 18:41   #2
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

33×313 Posts
Default

When we allow Google permission to index the forum, a bunch of other search engines come in and put a major strain on the server.

We've checked the syntax multiple times for letting only Google in.

Xyzzy is offline   Reply With Quote
Old 2008-01-29, 23:59   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Repรบblica de California

32·1,303 Posts
Default

Hey, Mike: in the following how-to-set-up-robots.txt page :

http://www.thesitewizard.com/archive/robotstxt.shtml

I note this quote:

"Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file (if you are running your site on an Apache server)."

Is that only relevant to directory names, or could it also be for search-engine-specific in|exclusion stuff? [Of course if you don't use Apache it's moot.]
ewmayer is offline   Reply With Quote
Old 2008-01-30, 04:13   #4
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

33·313 Posts
Default

Of course we run Apache.



The .htaccess stuff is, from what we understand from your link, just password protecting directories.

(There are other .htaccess functions, like redirecting stuff, and blocking IP addresses, but we don't think those apply to this discussion.)

For example, we use a .htaccess/,htpasswd deal on our administrative pages, so those never get indexed, but we can't think of a way to allow Google through that kind of system because it requires a password. (Plus, the users would need the password too!)

If there is a way to block search engines this way we are, like the Ferengi, all ears.
Xyzzy is offline   Reply With Quote
Old 2008-01-30, 04:19   #5
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

33·313 Posts
Default

For the record, the issue isn't even bandwidth. We can handle a lot more bandwidth than we use currently.

The problem is when a rogue search engine comes in it sometimes tries to index the whole site in one swoop, so the database server gets overloaded.

Google indexes very slowly and uses very little resources. And Google gives us a control panel to control the spider's activity.

We wish there was a way to coordinate with Google to index at certain times, and then we could make a cron job to swap the robots.txt file with a Google-friendly one, at certain times of the day, and then swap it back after a short period of time.

We're sure there is some sort of solution out there.
Xyzzy is offline   Reply With Quote
Old 2008-01-30, 09:05   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

101100000111112 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
For the record, the issue isn't even bandwidth. We can handle a lot more bandwidth than we use currently.

The problem is when a rogue search engine comes in it sometimes tries to index the whole site in one swoop, so the database server gets overloaded.

Google indexes very slowly and uses very little resources. And Google gives us a control panel to control the spider's activity.

We wish there was a way to coordinate with Google to index at certain times, and then we could make a cron job to swap the robots.txt file with a Google-friendly one, at certain times of the day, and then swap it back after a short period of time.

We're sure there is some sort of solution out there.
Possible idea: do the "rogue" search engines use a consistent set of IP addresses?

If so, block them at the firewall, letting Google and users through unscathed.


Paul
xilman is offline   Reply With Quote
Old 2008-01-30, 19:27   #7
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

204038 Posts
Default

Quote:
Possible idea: do the "rogue" search engines use a consistent set of IP addresses?
Not that we have been able to determine. However, when we do a lookup of the IP address from within the forum it somehow identifies the "user" as a bot. I think it has something to do with the way the "user" identifies itself. For example, every request for a page is logged with the referring page, the browser, the operating system and a pile of other things.

(It is possible to spoof this data.)

We have all the logs going back a long time but it is a mountain of data.
Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Anyone want a Google+ invite? chalsall Lounge 2 2011-07-29 19:31
mprime 24.14 blocked indefinitely Aillas Software 29 2005-11-23 17:27
Google Unto Others... ewmayer Soap Box 1 2005-08-09 14:13
Were on google moo Lounge 11 2005-01-28 14:47
Google Ads... Xyzzy Lounge 61 2004-12-25 02:20

All times are UTC. The time now is 02:46.


Sat May 21 02:46:38 UTC 2022 up 37 days, 47 mins, 0 users, load averages: 1.12, 1.31, 1.30

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โŠ โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โŠ† โŠ‚ โŠ„ โŠŠ โŠ‡ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”