Making a
simple search engine
We've now looked at both fopen() and fsockopen(), both of
which are great for reading in content from websites. However, thanks to the
way streams work in PHP, you can read remote data in with a huge selection of
functions - even down to the relatively lowly file_get_contents(). To show off
this functionality, I wrote a very simple search engine that spiders websites
by pulling out hyperlinks and inserting data into a MySQL table. The code is
very, very simple, and very naive - it's here to demonstrate a point, not be a
perfect search engine, so please don't base your own efforts on it!
<?php
$urls = array("http://www.slashdot.org");
$parsed = array();
$sitesvisited = 0;
mysql_connect("localhost", "phpuser", "alm65z");
mysql_select_db("phpdb");
mysql_query("DROP TABLE simplesearch;");
mysql_query("CREATE TABLE simplesearch (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE simplesearch ADD FULLTEXT(Contents);");
function parse_site() {
GLOBAL $urls, $parsed, $sitesvisited;
$newsite = array_shift($urls);
echo "\n Now parsing $newsite...\n";
// the @ is because not all URLs are valid, and we don't want
// lots of errors being printed out
$ourtext = @file_get_contents($newsite);
if (!$ourtext) return;
$newsite = addslashes($newsite);
$ourtext = addslashes($ourtext);
mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");
// this site has been successfully indexed; increment the counter
++$sitesvisited;
// this extracts all hyperlinks in the document
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);
if (count($matches)) {
$matches = $matches[0];
$nummatches = count($matches);
echo "Got $nummatches from $newsite\n";
foreach($matches as $match) {
// we want to ignore all these strings
if (stripos($match, ".exe") !== false) continue;
if (stripos($match, ".zip") !== false) continue;
if (stripos($match, ".rar") !== false) continue;
if (stripos($match, ".wmv") !== false) continue;
if (stripos($match, ".wav") !== false) continue;
if (stripos($match, ".mp3") !== false) continue;
if (stripos($match, ".sit") !== false) continue;
if (stripos($match, ".mov") !== false) continue;
if (stripos($match, ".avi") !== false) continue;
if (stripos($match, ".msi") !== false) continue;
if (stripos($match, ".rpm") !== false) continue;
if (stripos($match, ".rm") !== false) continue;
if (stripos($match, ".ram") !== false) continue;
if (stripos($match, ".asf") !== false) continue;
if (stripos($match, ".mpg") !== false) continue;
if (stripos($match, ".mpeg") !== false) continue;
if (stripos($match, ".tar") !== false) continue;
if (stripos($match, ".tgz") !== false) continue;
if (stripos($match, ".bz2") !== false) continue;
if (stripos($match, ".deb") !== false) continue;
if (stripos($match, ".pdf") !== false) continue;
if (stripos($match, ".jpg") !== false) continue;
if (stripos($match, ".jpeg") !== false) continue;
if (stripos($match, ".gif") !== false) continue;
if (stripos($match, ".tif") !== false) continue;
if (stripos($match, ".png") !== false) continue;
if (stripos($match, ".swf") !== false) continue;
if (stripos($match, ".svg") !== false) continue;
if (stripos($match, ".bmp") !== false) continue;
if (stripos($match, ".dtd") !== false) continue;
if (stripos($match, ".xml") !== false) continue;
if (stripos($match, ".js") !== false) continue;
if (stripos($match, ".vbs") !== false) continue;
if (stripos($match, ".css") !== false) continue;
if (stripos($match, ".ico") !== false) continue;
if (stripos($match, ".rss") !== false) continue;
if (stripos($match, "w3.org") !== false) continue;
// yes, these next two are very vague, but they do cut out
// the vast majority of advertising links. Like I said,
// this indexer is far from perfect!
if (stripos($match, "ads.") !== false) continue;
if (stripos($match, "ad.") !== false) continue;
if (stripos($match, "doubleclick") !== false) continue;
// this URL looks safe
if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
array_push($urls, $match);
echo "Adding $match...\n";
}
}
}
} else {
echo "Got no matches from $newsite\n";
}
// add this site to the list we've visited already
$parsed[] = $newsite;
}
while ($sitesvisited < 500 && count($urls) != 0) {
parse_site();
// this stops us from overloading web servers
sleep(5);
}
?>
$urls = array("http://www.slashdot.org");
$parsed = array();
$sitesvisited = 0;
mysql_connect("localhost", "phpuser", "alm65z");
mysql_select_db("phpdb");
mysql_query("DROP TABLE simplesearch;");
mysql_query("CREATE TABLE simplesearch (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE simplesearch ADD FULLTEXT(Contents);");
function parse_site() {
GLOBAL $urls, $parsed, $sitesvisited;
$newsite = array_shift($urls);
echo "\n Now parsing $newsite...\n";
// the @ is because not all URLs are valid, and we don't want
// lots of errors being printed out
$ourtext = @file_get_contents($newsite);
if (!$ourtext) return;
$newsite = addslashes($newsite);
$ourtext = addslashes($ourtext);
mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");
// this site has been successfully indexed; increment the counter
++$sitesvisited;
// this extracts all hyperlinks in the document
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);
if (count($matches)) {
$matches = $matches[0];
$nummatches = count($matches);
echo "Got $nummatches from $newsite\n";
foreach($matches as $match) {
// we want to ignore all these strings
if (stripos($match, ".exe") !== false) continue;
if (stripos($match, ".zip") !== false) continue;
if (stripos($match, ".rar") !== false) continue;
if (stripos($match, ".wmv") !== false) continue;
if (stripos($match, ".wav") !== false) continue;
if (stripos($match, ".mp3") !== false) continue;
if (stripos($match, ".sit") !== false) continue;
if (stripos($match, ".mov") !== false) continue;
if (stripos($match, ".avi") !== false) continue;
if (stripos($match, ".msi") !== false) continue;
if (stripos($match, ".rpm") !== false) continue;
if (stripos($match, ".rm") !== false) continue;
if (stripos($match, ".ram") !== false) continue;
if (stripos($match, ".asf") !== false) continue;
if (stripos($match, ".mpg") !== false) continue;
if (stripos($match, ".mpeg") !== false) continue;
if (stripos($match, ".tar") !== false) continue;
if (stripos($match, ".tgz") !== false) continue;
if (stripos($match, ".bz2") !== false) continue;
if (stripos($match, ".deb") !== false) continue;
if (stripos($match, ".pdf") !== false) continue;
if (stripos($match, ".jpg") !== false) continue;
if (stripos($match, ".jpeg") !== false) continue;
if (stripos($match, ".gif") !== false) continue;
if (stripos($match, ".tif") !== false) continue;
if (stripos($match, ".png") !== false) continue;
if (stripos($match, ".swf") !== false) continue;
if (stripos($match, ".svg") !== false) continue;
if (stripos($match, ".bmp") !== false) continue;
if (stripos($match, ".dtd") !== false) continue;
if (stripos($match, ".xml") !== false) continue;
if (stripos($match, ".js") !== false) continue;
if (stripos($match, ".vbs") !== false) continue;
if (stripos($match, ".css") !== false) continue;
if (stripos($match, ".ico") !== false) continue;
if (stripos($match, ".rss") !== false) continue;
if (stripos($match, "w3.org") !== false) continue;
// yes, these next two are very vague, but they do cut out
// the vast majority of advertising links. Like I said,
// this indexer is far from perfect!
if (stripos($match, "ads.") !== false) continue;
if (stripos($match, "ad.") !== false) continue;
if (stripos($match, "doubleclick") !== false) continue;
// this URL looks safe
if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
array_push($urls, $match);
echo "Adding $match...\n";
}
}
}
} else {
echo "Got no matches from $newsite\n";
}
// add this site to the list we've visited already
$parsed[] = $newsite;
}
while ($sitesvisited < 500 && count($urls) != 0) {
parse_site();
// this stops us from overloading web servers
sleep(5);
}
?>
It's commented throughout, and so shouldn't be a problem to
understand. That thing is pre-programmed to only index 500 URLs, but even that
will take about ten minutes to do on a moderate connection because it is
single-threaded. Once you have run the script, you'll need to be able to search
through it - here's the corresponding file:
<?php
if (isset($_POST['criteria'])) {
mysql_connect("localhost", "phpuser", "alm65z");
mysql_select_db("phpdb");
$criteria = addslashes($_POST['criteria']);
$result = mysql_query("SELECT URL FROM simplesearch WHERE MATCH(Contents) AGAINST ('$criteria') ORDER BY URL ASC;");
if (mysql_num_rows($result)) {
echo "Search found the following matches...<br /><br />";
echo "<ul>";
while ($r = mysql_fetch_assoc($result)) {
extract($r, EXTR_PREFIX_ALL, 'find');
echo "<li><a href=\"$find_URL\">$find_URL</A></li>";
}
echo "</ul>";
} else {
echo "No matches found for the criteria '$criteria'.<br /><br />";
}
}
?>
<form method="post">
Search for: <input type="text" name="criteria" />
<input type="submit" value="Go" />
</form>
if (isset($_POST['criteria'])) {
mysql_connect("localhost", "phpuser", "alm65z");
mysql_select_db("phpdb");
$criteria = addslashes($_POST['criteria']);
$result = mysql_query("SELECT URL FROM simplesearch WHERE MATCH(Contents) AGAINST ('$criteria') ORDER BY URL ASC;");
if (mysql_num_rows($result)) {
echo "Search found the following matches...<br /><br />";
echo "<ul>";
while ($r = mysql_fetch_assoc($result)) {
extract($r, EXTR_PREFIX_ALL, 'find');
echo "<li><a href=\"$find_URL\">$find_URL</A></li>";
}
echo "</ul>";
} else {
echo "No matches found for the criteria '$criteria'.<br /><br />";
}
}
?>
<form method="post">
Search for: <input type="text" name="criteria" />
<input type="submit" value="Go" />
</form>
Anyway, that was just a short example to see how easy
network programming is in PHP. Like I said, as a search engine it's basically
as simplistic as they come: there are numerous problems in there. At the very
least, a good search engine should at least cache the URLs of media items like
MP3s and AVI files, instead of ignoring them like that script does.
Furthermore, 500 URLs take up about 16MB of disk space, which is an enormous
amount for so little payback. There are almost certainly faster regular
expressions for link matching, too. So, if you really want to make your own
search engine, look somewhere else!
0 comments:
Post a Comment