How to implement a random time delay in perl?
May 6th, 2007
Hi folks –
I’m currently designing a crawling script – aka a scraper, robot, whatever you want to call it. It is perl based, running off ubuntu linux.. Don’t worry folks – I haven’t headed to the darkside – this is for research purposes only.
My aim is to crawl a large forum consecutively, and essentially save a copy of every page I find in a mysql database..
The script was going fine for about half an hour, and then it stopped – hit with a captcha to continue, with a message like “We have detected you are an automated script, if you aren’t, please enter this number to continue”..
Now, I’m thinking that the target server may be seeing a strong, steady load from my IP, and that’s what is causing the problem..
Logically, I’m hoping that if I add a random delay (10-30 seconds) between crawling pages I might get around this issue…
Problem is.. how to implement this in Perl?
I’m thinking something along these lines..
$oldtime = (time + 10);
while (1) {
if (time > $oldtime) {
&sub_to_call;
$oldtime = (time + 10);
}
}
But I’m damned if I can remember how to make a random number in perl.. If anyone has any ideas, please offer them up! if not, I’m gonna have to send my requests through TOR, which is going to be a headache..
Cheers,
Matt
Entry Filed under: 8. Linux Tips
If you found this page useful, consider linking to it.
Simply copy and paste the code below into your web site (Ctrl+C to copy)
It will look like this: How to implement a random time delay in perl?
6 Comments Add your own
1. Chris Hunt | May 6th, 2007 at 2:01 am
Top hit when googling for “perl random rumber” is http://www.perlmeme.org/howtos/perlfunc/rand_function.html , which seems to pick the bones out of it pretty well.
To wait for 10-30 seconds, do this:
sleep (int(rand(21)) + 10);
2. DuckMan | May 6th, 2007 at 2:15 am
Bingo!
Thanks Chris, that worked.. for about another 20 pages (out of about a gazillion)..
But now, I have another problem..
it seems that the site in question has figured out that my internet explorer useragent is behaving pretty well, and it’s ok, but the mozilla useragent I’m using for the crawling (from the same IP) is naughty and should no longer be allowed..
oh my gosh – this is a challenging problem!!
I’ll try round robin useragents.. dear oh me.. otherwise, does anybody know any perl modules that allow easy interfacing with TOR? That would have to be the ultimate solution..
M
3. DuckMan | May 6th, 2007 at 2:17 am
Also, while we’re at asking mates for help.. does anybody know a good wordpress plugin for displaying code extracts.. the default is UGLY UGLY UGLY 😛
4. JohnMu | May 6th, 2007 at 3:35 am
http://blog.igeek.info/wp-plugins/igsyntax-hiliter/ ?
5. DuckMan | May 6th, 2007 at 3:39 am
Thanks John!
That looks like a good solution..
CHRIS: It looks like your solution, in concert with ‘hopping’ between useragents has this particular site sussed..
thanks for your help!
But I’ve gotta say – I’m now fascinated, rather than just irritated, about how the scrapers work..
You’ll see I wrote about the little beggars over here – http://www.utheguru.com/backlink-bad-neighbourhood-penalty
M
6. NetNinja | December 4th, 2008 at 2:15 am
You can use perl to route traffic through the tor network. This will give you a new ip and useragent about every ten minutes.
You need to use LWP::UserAgent
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed