Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

worker_rest_api

Grzegorz Mrukwa edited this page Nov 27, 2017 · 1 revision

Motivation

The main problem addressed by this document is to provide consistent API exposed by each algorithm worker container.

Analysis

Common part of the algorithms:

  • dataset (is it common?)

Disjoint parts:

  • there may be different datasets needed (more than 1)
  • there may be other parameters needed, like ROI

API

Option 1 (async) - TEMPORARILY DISCARDED

  1. Requirements for the worker: a. Heartbeat b. 202 if accepted new job - immediate response from worker c. returns response to master when finishes d. 503 if scaled down & preoccupied
  2. Requirements for the master: a. Manage heartbeat for different type of workers independently b. Serve returned results (?) c. Renew request if 503
  3. Benefits: a. possibility to provide partial completion status b. possible fault tolerance
  4. Cons: a. complexity

Example request flow

  1. Post request to worker: master -> worker
    POST /worker/job -> 503 Unavailable
    POST /worker/job -> 503 Unavailable
    POST /worker/job -> 503 Unavailable
    POST /worker/job -> 503 Unavailable
    POST /worker/job -> 200 OK { id: job_id }
    
  2. Meanwhile, worker notifies master about its health: worker -> master (e.g. every 10sec)
    POST /master/healthcheck {
      processed_job: job_id,
      response_type: 'gmm_response'
    }
    
  3. Finally, worker responds to master with results: worker -> master (retried up to 60sec)
    Success -> POST /master
    {
      id: string = job_id,
      response_type: string = 'gmm_response',
      result: object = {
        ...?
      }
    }
    
    Failure in algorithm -> POST /master
    {
      id: string = job_id,
      response_type: string = 'error',
      stack_trace: string,
      exception: string,
      message: string
    }
    
    Failure in Web service
      -> no request to master
      -> worker removed from master's list because of healthcheck
    

Option 2 (synchronous) - TEMPORARILY ACCEPTED

  1. Requirements for the worker: a. 200 with payload when finished - response after potential several minutes
  2. Requirements for the master: a. Serve returned results
  3. Benefits: b. Simplicity of implementation
  4. Cons: a. No fault-tolerance

Example request flow

  1. Master posts job to worker: master -> worker
    POST /worker/job -> timeout - retry
    POST /worker/job -> timeout - retry
    POST /worker/job -> blocked by worker
    
  2. Worker responds with result: worker -> master
    Success -> 200 OK
    {
      response_type: string = 'gmm_response',
      result: object = {
        output_file: string = '\\share\data\output_file',
      }
    }
    
    Failed in algorithm -> 500 Server Error
    {
      response_type: string = 'algorithm_error',
      stack_trace: string,
      exception: string,
      message: string
    }
    
    Failed in Web service -> no output
    

Architecture - ACCEPTED

Involved services:

  • 2x Web
  • 1x master
  • 1x database: MSSQL/Postgres?
  • Nx worker: GMM/DiviK/other
web traffic -> Web                 Worker
                   \             /
               ...   DB - Master   ...
                   /             \
web traffic -> Web                 Worker

Web

Web = front + API + nginx

Responsibilities:

  • accept computation requests from external world (API):
    • DiviK
    • GMM
    • ROI
  • allow computation artifacts retrieval
  • serve frontend (due to Docker architecture)
  • saves information of computation task type that allows to retrieve worker name
  • serve computed resources

Master

Master = single process

Responsibilities:

  • zoo keeper = sends tasks to proper workers
  • knows which worker (by host name) is responsible for particular job (this information should be retrieved by a type of job from database)

Worker

Worker = API + algorithm calculation

Responsibilities:

  • accept incoming computation requests
  • fetch files from storage
  • perform computation with defined settings
  • saves results to files
  • returns location to results & success status, total duration

Architecture Option B - DISCARDED

                    Web                 Worker
                  /     \             /
web traffic -> LB   ...   DB - Master   ...
                  \     /             \
                    Web                 Worker