Harvest the web

The last few weeks I’ve been trying to find ways to interact more easily with the Steam Community data that is exposed for all Groups and Users with public profiles. I was frustrated by the fact that Valve have not publicised an official API for interacting with this data and that the unofficial efforts failed to meet the scope I was looking for — not to mention being badly broken due to changes to the HTML of the target website.

My initial thought was to follow a model similar to this new project. But this approach leaves a number of common scraping problems unresolved:

  1. No caching. Each time data is required, the code will request the source HTML from the target URL
  2. Linear performance. Each time data is required, the code must process the HTML into API objects
  3. Relies on well-formed XML. If PHP’s SimpleXML extensions receives tag-soup the solution will fail
  4. Complex code to maintain. When the target website changes the structure of their HTML, it means a complete re-write of the majority of the API code

Enter the Reaper

To address these issues, I have been developing Reaper. Currently a PHP implementation that doesn’t require any extensions or external libraries. Reaper attempts to condense the common tasks of scraping into small blocks of efficient code and cache the results transparently for best performance:

  1. Reaper requests the URL (via YQL). HTML returned is tidied into well-formed XML and cached
  2. Reaper accepts your data definition array which maps data labels to XPath queries, RegEx expressions and/or callback functions to scrape the relevant data
  3. Reaper caches the resulting data object and returns it to you

There’s more work to do to improve error-handling and documentation, but so far I’m pretty pleased with the results.

Meanwhile, I’ve stumbled onto Steam Condenser, so I may not need to roll my own Steam Community API after all 🙂

I’m keen to hear suggestions and feedback, so let me know what you think as a comment, using the contact form or on Twitter.

Posted in

Leave a reply