Skip to content

Navigating around a grid of cells like XPath for spreadsheets; supports Python 3.5+

License

Notifications You must be signed in to change notification settings

sensiblecodeio/xypath

Repository files navigation

xypath

Select different parts of a spreadsheet via functions.

----------------------------------------------------------------------
| animal  | habitat    | max age | population 2012 | Population 2013 |
| zebra   | grasslands | 40      | 20M             | 21M             |
| baboon  | jungle     | 45      | 35M             | 40M             |
| narwhal | ocean      | 115     | 1M              | 500,000         |
----------------------------------------------------------------------
| AnimalSheet |
                                               * not strictly accurate

Example

Question: What is the change in population for each type of animal between 2012 and 2013?

First, we need to load the spreadsheet. Supposing we've got it on disk already.

import xypath
table = xypath.Table.from_filename("animals.xls", table_name='AnimalSheet')

Next we need to find the row and column headers. In this case the row headers are "zebra", "baboon" and "narwhal" and the column headers are "population 2012" and "Population 2013" (note the change in case!)

One way of getting the row headers would be to find "animal" and fill down to get the actual animals:

animals = table.filter('animal').assert_one().fill(xypath.DOWN)

Let's break that down. First we use the table filter method which finds one or more cells matching the given argument. If you simply provide a string like animal it will search for exact matches. filter returns a Bag which is a cell container with some useful properties.

assert_one blows up if the Bag doesn't contain exactly one cell.

We then do fill(xypath.DOWN) which gets all the cells below and excluding the "animal" cell.

Next we need to find the column containing the year. For this we will use a regular expression.

years = table.filter(re.compile('[Pp]opulation \d{4}'))

The filter function recognises that it's been passed a regular expression and uses it to find matching cells in the table.

Now we've got two bags:

  1. animal contains "zebra", "baboon" and "narwhal"
  2. years contains "population 2012" and "Population 2013"

Next we want to find the cells which line up with each of those animal and year cells.

for (animal, year, population) = animals.junction(years):
    print("{0} : {1} : {2}".format(
        animal.value, year.value, population.value))

This would print:

zebra : population in 2012 : 20M
baboon : population in 2012 : 35M
narwhal : population in 2012 : 1M
zebra : Population in 2013 : 21M
baboon : Population in 2013 : 40M
narwhal : Population in 2013 : 500,000

You can junction two Bags of header cells together the intersection between each pair of cells. Junction always yields a three-long tuple, in this case containing (animal, year, population).

It's left up to the reader to extract the year from the population string and convert the millions to reasonable numbers!

Loading tables

table = xypath.Table.from_filename("mysheet.xls", table_name="Sheet 1")

By using a file-object:

with open('spreadsheet.xls') as f:
    table = xypath.Table.from_file_object(f, table_index=0)

Note that you need to specify which table in the spreadsheet table_name (relating to ie Excel sheet name) or table_index (where 0 is the first sheet found)

From a URL (via a StringIO file-like-object):

import requests
from cStringIO import StringIO

response = requests.get("http://example.io/mysheet.xls")
f = StringIO(response.content)
table = xypath.Table.from_file_object(f, table_name='Sheet1')

Or using the underlying messytables library directly:

xypath_tables = []
with open('spreadsheet.xls', 'rb') as f:
    for messy_table in messytables.excel.XLSTableSet(f).tables:
        xypath_table = xypath.Table.from_messy(messy_table)

Filtering

You can use the filter(..) method to get Bags of cells which match certain criteria.

bag.filter("kitten")                     # cell.value is exactly 'kitten'
table.filter(re.compile(".a.*e"))        # regular expression of cell.value
table.filter(hamcrest.ends_with("c"))    # any pyhamcrest matcher on cell.value

All the above matchers act exclusively on the value of the cell.

It's also possible to pass any callable to filter. In this case, filter passes the cell to the callable so it's possible to access properties such as cell.x and cell.value:

def is_header_cell(cell):
    return cell.x == 0 and cell.value.lower().startswith('population ')

table.filter(is_header_cell)

table.filter(lambda cell: cell.x == 2)   # explicit lambda function on each cell

Other Selection Methods

Select different cells in the table based on those currently in the bag

def has_same_text(table_cell, bag_cell):
    return table_cell.value == bag_cell.value

bag.select(has_same_text)  # cells with same value
bag.shift(x=-2)            # cells two to the left of the current cells
bag_b = bag_a.fill(xypath.LEFT)      # cells to the LEFT (RIGHT, UP, DOWN, UP_RIGHT ...) excluding bag_a

Method Chaining

Chain methods together; everything returns a bag!

dollars = table.filter("Amount").assert_one().fill(xypath.DOWN).filter(re.search("$"))

Singleton Bags

When a Bag contains only one cell (a singleton Bag), you can call bag.value and get the value of the single cell inside the bag. This is also true for cell.x, cell.y etc.

Get the value (of a bag containing only one cell) or

lonely_cell.value

Get cells which are at the intersection of two other cells:

triplets = row_header_bag.junction(column_header_bag)

Running the tests

Set up a virtual environment, and install requirements.txt:

virtualenv venv
. venv/bin/activate
pip install -r requirements.txt

Then run the tests using nosetests:

nosetests # runs all tests
nosetests test/test_bag.py # runs a single test

About

Navigating around a grid of cells like XPath for spreadsheets; supports Python 3.5+

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages