Reaper

Reaper is a PHP class that makes scraping the web easier and more efficient! Scraping can be a useful technique when building mashups or for creating APIs for websites that don’t provide one.*

You can use XPath, Regular Expressions (RegEx) and callback functions to define which data you want to retrieve from a URL, then:

  1. Reaper requests the URL (via YQL). HTML returned is tidied into well-formed XML and cached
  2. Reaper uses your data definition to scrape the relevant data
  3. Reaper caches the resulting data object and returns it to you

Caching is seamless so you only have to ensure the cache directory is writable and the rest happens transparently.

Examples

For sake of examples, we’ll pretend the HTML for http://www.example.com looks like this:

<html>
<head><title>example.com</title></head>
<body>
<div id="container">
	<p><a href="http://example.com/another-page.html">Another page</a></p>
</div>
</body>
</html>

Basic scrape

Let’s say you want to scrape a link from http://www.example.com — You could call Reaper with a data definition array like this:

$data = $Reaper->harvest (
	'http://www.example.com/',
	array(
		'url' => '//div//a/@href',
	)
);

This simple XPath query will return a data object we can use like this:

echo $data->url;

Repeated sections

If we were expecting a number of links, or are trying to collect data from repeated sections of HTML/XML, we need to alter the data definition array:

$data = $Reaper->harvest (
	'http://www.example.com/',
	array(
		'links' => array(
			'_repeated' => '//div//a',
			'url' => './@href', // xpath relative to '_repeated'
			'text' => './text()',
		),
	)
);

This example would return $data->links as an array. We can loop though the array and use the data like this:

foreach ($data->links; as $link) {
	echo '<a href="'.$link->url.'">'.$link->text.'</a>';
}

Of course embedding HTML for output inside PHP strings like this is less than ideal, but you get the idea.

Reaper query

For a more complex example, let’s say you wanted to make sure the link returned use the full www.example.com domain name. You could call Reaper with a data definition array like this:

$data = $Reaper->harvest (
	'http://www.example.com/',
	array(
		'wwwurl' => '//div//a/@href||http://example.com/(.*)||i||http://www.example.com/$1',
	)
);

This is an example of a Reaper query which is essentially a combination of an XPath query with RegEx pattern, RegEx modifiers and RegEx replacement in the one string delimited by the double pipe '||'.

The RegEx portions of the Reaper query are each optional. For example, if you were only interested in the domain-relative path of the links and you had no need for modifiers or replacement, you could use a query similar to the following:

$data = $Reaper->harvest (
	'http://www.example.com/',
	array(
		'path' => '//div//a/@href||http://example.com(/.*)',
	)
);

In this case the path starting from the first slash / after the domain is returned as $data->path

Modifiers

All of the PHP PCRE modifiers are supported along with:

g
Global. By default (without this modifier) Reaper will return or replace only the first RegEx match. By specifying the g modifier, Reaper will return or replace all matches found in the input string.

Callback functions

Callback functions can be used in place of Reaper queries using the following syntax:

function callbackFunction ($dom, $textLabel, $hrefLabel) {
	// Use the $dom and the arguments to parse the information you are looking for.
	// The result should be a String or ReaperDataObject so it can be serialized and cached.
	$result = new ReaperDataObject();
	$result->$textLabel = $dom->xpath('//div//a/text()').'';
	$result->$hrefLabel =  $dom->xpath('//div//a/@href').'';
	return $result;
}

$data = $Reaper->harvest (
	'http://www.example.com/',
	array(
		'link' => array(
			'_callback',
			'callbackFunction', // can also use: array($callbackObj, 'callBackMethod')
			array('text', 'url'), // arguments to pass to callback function
		),
	)
);

In this example, the data will be available in $data->link->text and $data->link->url.

Combinations

All of these techniques can be combined to build up more complex data objects.

Installation

  1. Download the release package
  2. Unzip the package (discard the folder with version number)
  3. Add the folder Reaper to your projects library directory
  4. Ensure the cache directory is writable by the PHP process (by default, this directory sits inside the Reaper directory)

Use

Once Reaper is installed you can use it by:

  1. Including the class:
    require_once('Reaper/class.Reaper.php');
  2. Instantiating a new Reaper object
    $Reaper = new Reaper();

    Optionally, you can pass a configuration array during instantiation:

    $config = array(
    	'cacheDir' => '/path/to/cache/directory/', // default is Reaper/cache/
    	'cacheXML' => true, // best to leave this to true
    	'cacheDataObjects' => true, // change this to false for development and debugging
    );
    
    $Reaper = new Reaper($config);

You can now call the $Reaper->harvest() method (as per examples) to perform scraping and data display.

More examples and documentation will be available in future.

Download

Reaper is licensed under the The GNU General Public License (GPL), by downloading and/or using it you are agreeing to abide by the terms of that license and you are agreeing not to use it to harvest email addresses or other personal contact details.

Package

Download the complete Reaper 0.1 package (.zip)

Road map

  • Better error-handling and guidance for debugging Reaper queries.
  • More examples and documentation.

* Terms of use

Please use Reaper responsibly and legally. Some websites prohibit the use of scraping and data mining tools in their Terms and Conditions, Terms of use or Copyright statements. Please review these documents carefully and if in doubt request written permission from the target website’s owner before scraping.

Also, if a website provides a data API, where possible access data using the API rather than by scraping.

Note: This class is not created for harvesting email addresses or other personal contact details (as much as any tool that works with HTML/XML is created for such purposes). It is a condition of downloading and using this class that it is not used for these purposes.

Posted in

Leave a reply