除了始终遵守robot.txt
“disallow:”和非标准“Crawl-delay:”
但是,如果站点未指定显式爬网延迟,则应将默认值设置为什么?
解决方法 我们使用的算法是:// If we are blocked by robots.txt// Make sure it is obeyed.// Our bots user-agent string contains a link to a HTML page explaining this.// Also an email address to be added to so that we never even consIDer their domain in the future// If we receive more that 5 consecutive responses with http response code of 500+ (or timeouts)// Then we assume the domain is either under heavy load and does not need us adding to it.// Or the URL we are crawling are completely wrong and causing problems// Wither way we suspend crawling from this domain for 4 hours.// There is a non-standard parameter in robots.txt that defines a min crawl delay// If it exists then obey it.//// see: http://www.searchtools.com/robots/robots-txt-elements.HTMLdouble PolitenssFromrobotsTxt = getRobotPolitness();// Work Size politeness// Large popular domains are designed to handle load so we can use a// smaller delay on these sites then for smaller domains (thus smaller domains hosted by// mom and pops by the family PC under the desk in the office are crawled slowly).//// But the max delay here is 5 seconds://// domainSize => Range 0 -> 10//double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)),5);//// You can find out how important we think your site is here:// http://www.opensiteexplorer.org// Look at the Domain Authority and diveIDe by 10.// Note: This is not exactly the number we use but the two numbers are highly corelated// Thus it will usually give you a fair indication.// Take into account the response time of the last request.// If the server is under heavy load and taking a long time to respond// then we slow down the requests. Note time-outs are handled abovedouble responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime,2);// Use the slower of the calculated timesdouble result = std::max(workSizeTime,responseTime);//Never faster than the crawl-delay directiveresult = std::max(result,PolitenssFromrobotsTxt);// Set a minimum delays// So never hit a site more than every 10th of a secondresult = std::max(result,0.1);// The maximum delay we have is every 2 minutes.result = std::min(result,120.0)总结
以上是内存溢出为你收集整理的web-crawler – 网络爬虫的典型礼貌因素?全部内容,希望文章能够帮你解决web-crawler – 网络爬虫的典型礼貌因素?所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)