Skip to content

Shepherd2

Xander edited this page Mar 27, 2018 · 2 revisions

GOALS

  1. Shepherd 2.0 should not copy a "substantial" amount of any commercial EPG into its final output by default.

  2. In terms of data quality and accuracy, Shepherd 2.0 should be at least the equal of Shepherd 1.x.

PROCESS

  1. Input
  2. Predict
  3. Validate
  4. Augment
  5. Postprocess
  6. Cache
  7. Output

STRATEGY

(1) Shepherd begins with a non-commercial base EPG (rather than building a new EPG completely from scratch).

(1.1) This EPG is ideally Shepherd's output from yesterday.

(1.2) Failing that, the user is asked to provide input XMLTV. (EG from OzTivo.)

(2) Shepherd uses this initial EPG to fill in future days via prediction.

(2.1) Shepherd generates a 14-day EPG by default.

(2.2) The Predictor is a self-contained module, as is the Reconciler. This allows future enhancement of the Predictor without complicating Shepherd.

(2.3) As a basic first step, Shepherd simply adds 7 days to the timestamps of old data until it has filled out its desired time window.

(2.4) (Future:) A more advanced Predictor removes or replaces one-off shows such as movies.

(2.5) (Future:) A more advanced Predictor uses thetvdb.com and similar sources to look up sub-titles for future episodes of series.

(2.6) The components imdb_augment_data and tvdb_augment_data are no longer considered "postprocessors" but rather "augmentors." Augmentors are like grabbers, but retrieve data on a per-show basis, rather than on a schedule-dependent time window.

(3) Shepherd validates against online sources.

(3.1) The process of "validation" is logically separate to "augmentation." Validation is the process of checking only five fields: title, channel, start time, stop time, and sub-title. Augmentation is the process of adding or enhancing data in other fields (e.g. description, year, star-rating). Shepherd completes Validation before starting Augmentation. That is, given a limited capacity to draw on commercial sources, Shepherd will take time, title, etc fields before date, description, or other fields.

(3.2) Where Shepherd finds conflicts between its EPG and an online source, it intelligently resolves such conflicts and updates its own EPG. This process may involve copying fields from an online EPG into its output.

There are 3 functions performed here: 1. Figuring out which grabbers can supply our wanted data, and intelligently selecting the best/most efficient one. 2. Invoking the grabber and testing, parsing, and analysing its output. 3. Comparing the output against Shepherd's internal EPG and merging as necessary. Currently, 1 and 2 are performed by Shepherd internally, 3 by the reconciler component, and there's probably no reasonto change this. The main difference would be that rather than running the reconciler once only, it would be invoked many times, following each new input.

(3.3) Shepherd understands which online sources are commercial/protected (e.g. YourTV) and which may be drawn from freely (e.g. oztivo).

(3.4) Each field (e.g. title, year, desc) has a hard-coded "substantiality" value. Shepherd tracks the value of the fields it copies and by default will stop copying from a commercial source once the value of copied fields exceeds a hard-coded limit, which signifies "substantial" copying. It thereafter considers that source unavailable for more copying that day.

(3.5) Shepherd continues to gather online data for validation until it has reached an optimal state. An optimal state is an intelligent balance of the need to validate guide data with the quality of the remaining available online data plus the bandwidth cost of doing so.

(3.6) Shepherd considers it more urgent to validate shows if they are scheduled in the near future, in prime-time, or the user has a MythTV recording rule for them.

(4) Shepherd augments its EPG with online sources.

(4.1) Augmentation is the same process as validation, driven by the same logic engine, except:

(a) Shepherd wants data on a per-show basis, not a time window, so may employ augmentors (not just grabbers).

(b) Shepherd looks at data in fields other than the three essential ones: title, time, and sub-title.

(b) When determining whether it is worth downloading more online data, the need to augment a show is considered less urgent.

(5) Shepherd employs postprocessors to flag HDTV, correct timezone data, etc, in the same manner as 1.x.

(6) Shepherd maintains an internal cache.

(6.1) The output from grabbers/augmentors used in one stage remains available to the others. For example, if the Predictor in Stage 2 looked up details for "House", these could be used by the Augmentor (Stage 4).

(6.2) (Future:) Output from grabbers/augmentors remains in Shepherd's internal cache for use in future days, and Shepherd makes intelligent decisions about how to efficiently use it. (For now, however, this is probably too ambitious: it's non-essential and may not be worthwhile.)

(7) Shepherd explicitly caches its output for use in tomorrow's run.

Clone this wiki locally