-
Notifications
You must be signed in to change notification settings - Fork 36
Scraping HTML
Tim L edited this page May 27, 2014
·
46 revisions
::sigh::
First, a nice article about just using the web as an API.
Other's work:
- https://scraperwiki.com
- Python-based parser: BeautifulSoup
-
http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl parses an HTML string into a DOM object.
- It's functions are in
xmlns:d="data:,dpc".
- It's functions are in
This page lists some XSL utility functions that we've developed to scrape HTML.
The following functions help scrape HTML elements into useful strings. It uses the the following namespace.
xmlns:html="http://www.w3.org/1999/xhtml"
We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.
<xsl:value-of select="concat($DQ,string-join((
$perigee,$apogee,$inclination,$period,$semi-major-axis,
),
concat($DQ,',',$DQ)),$DQ,$NL)"/>
http://www.darpa.mil/OpenCatalog/index.html circa Feb 2014
<tr>
<td>Aptima Inc.</td>
<td>
<a href='http://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
</td>
<td>Analytics</td>
<td>2014-07</td>
<td>https://github.com/Aptima/pattern-matching.git</td>
<td>
<a href='stats/pattern-matching/index.html'>stats</a>
</td>
<td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
<td>ALv2</td>
</tr>
http://hcil2.cs.umd.edu/newvarepository/benchmarks.php
Definition:
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 2013 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 2013 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 2013 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
- Dec 1 19:06 2013 n2yo-com/browse/src/html2csv.xsl (same as shown)
Definition:
<xsl:function name="html:anchor-labels">
<xsl:param name="anchors"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
Definition:
<xsl:function name="html:anchor-hrefs">
<xsl:param name="anchors"/>
<xsl:param name="base"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="concat($base,normalize-space(@href))"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
Uses:
- n2yo-com/satellites/src/html2csv.xsl