Skip to content

Scraping HTML

Tim L edited this page May 27, 2014 · 46 revisions

::sigh::

What is first

First, a nice article about just using the web as an API.

Other's work:

What we will cover

This page lists some XSL utility functions that we've developed to scrape HTML.

Let's get to it

The following functions help scrape HTML elements into useful strings. It uses the the following namespace.

xmlns:html="http://www.w3.org/1999/xhtml"

We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.

   <xsl:value-of select="concat($DQ,string-join((
                                                 $perigee,$apogee,$inclination,$period,$semi-major-axis,
                                                ),
                                                concat($DQ,',',$DQ)),$DQ,$NL)"/>

Example inputs

Darpa

http://www.darpa.mil/OpenCatalog/index.html circa Feb 2014

<tr>
  <td>Aptima Inc.</td>
  <td>
     <a href='http://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
  </td>
  <td>Analytics</td>
  <td>2014-07</td>
  <td>https://github.com/Aptima/pattern-matching.git</td>
  <td>
     <a href='stats/pattern-matching/index.html'>stats</a>
  </td>
  <td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
  <td>ALv2</td>
</tr>
Visual Analytics Benchmark Repository

http://hcil2.cs.umd.edu/newvarepository/benchmarks.php

html:text

Definition:

<xsl:function name="html:text">
   <xsl:param name="node"/>
   <xsl:variable name="together">
      <xsl:for-each select="$node//text()">
         <xsl:value-of select="normalize-space(.)"/>
      </xsl:for-each>
   </xsl:variable>
   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Usage:

<xsl:template match="html:tr">
   <xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>

Uses:

  • Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
  • Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
  • Dec 5 09:26 2013 n2yo-com/satellites/src/html2csv.xsl (shown above)
  • Dec 4 13:12 2013 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
  • Dec 3 16:45 2013 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
  • Dec 1 19:06 2013 n2yo-com/browse/src/html2csv.xsl (same as shown)

html:anchor-labels

Definition:

<xsl:function name="html:anchor-labels">
   <xsl:param name="anchors"/>

   <xsl:variable name="together">
      <xsl:for-each select="$anchors">
         <xsl:if test="position() gt 1">
            <xsl:value-of select="'||'"/>
         </xsl:if>
         <xsl:value-of select="normalize-space(.)"/>
      </xsl:for-each>
   </xsl:variable>

   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Uses:

html:anchor-hrefs

Definition:

<xsl:function name="html:anchor-hrefs">
   <xsl:param name="anchors"/>
   <xsl:param name="base"/>

   <xsl:variable name="together">
      <xsl:for-each select="$anchors">
         <xsl:if test="position() gt 1">
            <xsl:value-of select="'||'"/>
         </xsl:if>
         <xsl:value-of select="concat($base,normalize-space(@href))"/>
      </xsl:for-each>
   </xsl:variable>

   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Uses:

html:parse-value

Uses:

  • n2yo-com/satellites/src/html2csv.xsl

Clone this wiki locally