PHP Browser-Based Website Crawler

Posted by Nessa | Posted in ,, | Posted on May 7, 2007

6

I figured out a way to create a php website crawler that can be run via web browser instead of command line. You can use this to harvest links from a website for use in a database or search engine…or to see how easily a spider or bot can creep your site. Try it here!

<html>
<head><title>PHP Website Crawler</title></head>
<body>
<font face="verdana" color=#66ccff">
<form id="crawl" method="post" action="">

<label>URL:
<input name="url" type="text" id="url" value="<?php $url; ?>http://website.com" size="70" maxlength="255" />
</label>
<br />
<br />
<label>
<input type="submit" name="Submit" value="Crawl!" />
</label>
<br />
</form>
</body>
</html>
<?php
if (isset($_POST['url'])) {
$url = $_POST['url'];
$f = @fopen($url,"r");
while( $buf = fgets($f,1024) )
{
$buf = fgets($f, 4096);
preg_match_all("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU",$buf,$words);
for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
$cur_word = strtolower($words[$i][$j]);
print "$cur_word<br>";
}
}
}
}
?>

Share and Enjoy:
  • Digg
  • DZone
  • del.icio.us
  • Technorati
  • Facebook
  • Google Bookmarks
  • StumbleUpon
  • Twitter
  • Sphinn
  • Mixx
  • blogmarks
  • Furl
  • Reddit
  • Slashdot
  • RSS

No related posts.

Comments (6)

Thanks.

Hey it works great, except that you must change these:

To these:

Otherwise it won’t run.
I’m trying to convert it now to extract emails…

WordPress coverts quotes to backticks…

Hi,
Nice work keep it up.
thanks

the code gives notices, just change two lines:
for( $i = 0; isset($words[$i]); $i++ ) {
for( $j = 0; isset($words[$i][$j]); $j++ ) {

and its perfect :)

Code is great and it works. Can you give me permission to use it for my thesis paper. You can answer me on my email. Thanks.

Post a comment