-
Notifications
You must be signed in to change notification settings - Fork 36
Scraping HTML
Tim L edited this page May 27, 2014
·
46 revisions
::sigh::
First, a nice article about just using the web as an API.
Other's work:
- https://scraperwiki.com
- Python-based parser: BeautifulSoup
-
http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl parses an HTML string into a DOM object.
- It's functions are in
xmlns:d="data:,dpc".
- It's functions are in
This page lists some XSL utility functions that we've developed to scrape HTML:
The following functions help scrape HTML elements into useful strings. It uses the following namespace.
xmlns:html="http://www.w3.org/1999/xhtml"
We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.
<xsl:value-of select="concat($DQ,string-join((
$perigee,$apogee,$inclination,$period,$semi-major-axis,
),
concat($DQ,',',$DQ)),$DQ,$NL)"/>
http://www.darpa.mil/OpenCatalog/index.html circa Feb 2014
<tr>
<td>Aptima Inc.</td>
<td>
<a href='http://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
</td>
<td>Analytics</td>
<td>2014-07</td>
<td>https://github.com/Aptima/pattern-matching.git</td>
<td>
<a href='stats/pattern-matching/index.html'>stats</a>
</td>
<td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
<td>ALv2</td>
</tr>
http://hcil2.cs.umd.edu/newvarepository/benchmarks.php
Definition:
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 2013 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 2013 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 2013 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
- Dec 1 19:06 2013 n2yo-com/browse/src/html2csv.xsl (same as shown)
Definition:
<xsl:function name="html:anchor-labels">
<xsl:param name="anchors"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
Definition:
<xsl:function name="html:anchor-hrefs">
<xsl:param name="anchors"/>
<xsl:param name="base"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="concat($base,normalize-space(@href))"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
Uses:
- n2yo-com/satellites/src/html2csv.xsl