From 1adf454f77edec25fc3650ceec8a1d1263ff10c6 Mon Sep 17 00:00:00 2001 From: Mikhail Korobov Date: Thu, 21 Jun 2018 18:10:40 +0500 Subject: [PATCH] DOC explain new .attrib property (a follow-up to https://github.com/scrapy/parsel/pull/107). --- docs/usage.rst | 83 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 77 insertions(+), 6 deletions(-) diff --git a/docs/usage.rst b/docs/usage.rst index 8a2d0ad5..3ec28427 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -101,6 +101,28 @@ selectors. This API can be used for quickly selecting nested data:: 'image4_thumb.jpg', 'image5_thumb.jpg'] +Instead of using '@src' XPath it is possible to query for attributes using +``.attrib`` property of a :class:`~parsel.selector.Selector`:: + + >>> [img.attrib['src'] for img in selector.css('img')] + ['image1_thumb.jpg', + 'image2_thumb.jpg', + 'image3_thumb.jpg', + 'image4_thumb.jpg', + 'image5_thumb.jpg'] + +As a shortcut, ``.attrib`` is also available on SelectorList directly; +it returns attributes for the first matching element:: + + >>> selector.css('img').attrib['src'] + 'image1_thumb.jpg' + +This is most useful when only a single result is expected, e.g. when selecting +by id, or selecting unique elements on a web page:: + + >>> selector.css('base').attrib['href'] + 'http://example.com/' + To actually extract the textual data, you must call the selector ``.extract()`` method, as follows:: @@ -132,6 +154,9 @@ Now we're going to get the base URL and some image links:: >>> selector.css('base::attr(href)').extract() ['http://example.com/'] + >>> selector.css('base').attrib['href'] + 'http://example.com/' + >>> selector.xpath('//a[contains(@href, "image")]/@href').extract() ['image1.html', 'image2.html', @@ -215,6 +240,9 @@ Examples: make much sense: text nodes do not have attributes, and attribute values are string values already and do not have children nodes. +.. note:: + See also: :ref:`selecting-attributes`. + .. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors @@ -237,13 +265,56 @@ too. Here's an example:: >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) - ... print 'Link number %d points to url %s and image %s' % args + ... print('Link number %d points to url %s and image %s' % args) + + Link number 0 points to url ['image1.html'] and image ['image1_thumb.jpg'] + Link number 1 points to url ['image2.html'] and image ['image2_thumb.jpg'] + Link number 2 points to url ['image3.html'] and image ['image3_thumb.jpg'] + Link number 3 points to url ['image4.html'] and image ['image4_thumb.jpg'] + Link number 4 points to url ['image5.html'] and image ['image5_thumb.jpg'] + +.. _selecting-attributes: + +Selecting element attributes +---------------------------- + +There are several ways to get a value of an attribute. First, one can use +XPath syntax:: + + >>> selector.xpath("//a/@href").extract() + ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] + +XPath syntax has a few advantages: it is a standard XPath feature, and +``@attributes`` can be used in other parts of an XPath expression - e.g. +it is possible to filter by attribute value. + +parsel also provides an extension to CSS selectors (``::attr(...)``) +which allows to get attribute values:: + + >>> selector.css('a::attr(href)').extract() + ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] + +In addition to that, there is a ``.attrib`` property of Selector. +You can use it if you prefer to lookup attributes in Python +code, without using XPath of CSS extension:: + + >>> [a.attrib['href'] for a in selector.css('a')] + ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] + +This property is also available on SelectorList; it returns a dictionary +with attributes of a first matching element. It is convenient to use when +a selector is expected to give a single result (e.g. when selecting by element +ID, or when selecting an unique element on a page):: + + >>> selector.css('base').attrib + {'href': 'http://example.com/'} + >>> selector.css('base').attrib['href'] + 'http://example.com/' + +``.attrib`` property of an empty SelectorList is empty:: - Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] - Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] - Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] - Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] - Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg'] + >>> selector.css('foo').attrib + {} Using selectors with regular expressions ----------------------------------------