Skip to content

Commit 45a5e4c

Browse files
committed
New "working with data" tutorial and improve getting started guide
1 parent 2f2ebae commit 45a5e4c

File tree

4 files changed

+236
-6
lines changed

4 files changed

+236
-6
lines changed

assets/docs/data_tutorial.md

+168
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: Working with data
3+
toc: true
4+
toc-title: Table of Contents
5+
---
6+
## Overview
7+
LocusZoom.js aims to provide reusable and highly customizable visualizations. Towards this goal, a separation of concerns is enforced between data adapters (data) and data layers (presentation).
8+
9+
## Your first plot: defining how to retrieve data
10+
All data retrieval is performed by *adapters*: special objects whose job is to fetch the information required to render a plot. A major strength of LocusZoom.js is that it can connect several kinds of annotation from different places into a single view: the act of organizing data requests together is managed by an object called `LocusZoom.DataSources`. Below is an example creating a "classic" LocusZoom plot, in which GWAS, LD, and recombination rate are overlaid on a scatter plot, with genes and gnomAD constraint information on another track below. In total, five API endpoints are used to create this plot; four standard datasets, and one user-provided summary statistics file.
11+
12+
```javascript
13+
const apiBase = 'https://portaldev.sph.umich.edu/api/v1/';
14+
const data_sources = new LocusZoom.DataSources()
15+
.add('assoc', ['AssociationLZ', {url: apiBase + 'statistic/single/', params: { source: 45, id_field: 'variant' }}])
16+
.add('ld', ['LDServer', { url: 'https://portaldev.sph.umich.edu/ld/', source: '1000G', population: 'ALL', build: 'GRCh37' }])
17+
.add('recomb', ['RecombLZ', { url: apiBase + 'annotation/recomb/results/', build: 'GRCh37' }])
18+
.add('gene', ['GeneLZ', { url: apiBase + 'annotation/genes/', build: 'GRCh37' }])
19+
.add('constraint', ['GeneConstraintLZ', { url: 'https://gnomad.broadinstitute.org/api/', build: 'GRCh37' }]);
20+
```
21+
22+
Of course, defining datasets is only half the problem; see the [Getting Started Guide](index.html) for how to define rendering instructions (layout) and combine these pieces together to create the LocusZoom plot.
23+
24+
### Understanding the example
25+
In the example above, a new data source is added via a line of code such as the following:
26+
27+
```javascript
28+
data_sources.add('assoc', ['AssociationLZ', {url: apiBase + 'statistic/single/', params: { source: 45, id_field: 'variant' }}]);
29+
```
30+
31+
A lot is going on in this line!
32+
33+
* `data_sources.add` defines a piece of information that *could* be used by the plot. (if no layout asks for data from this source, then no API request will ever be made)
34+
* The first argument to the function is a *namespace name*. It is an arbitrary reference to this particular piece of data. For example, you might want to plot three association studies together in the same window, and they could be defined as `.add('mystudy', ...)`, `.add('somephenotype', ...)`, `.add('founditinthegwascatalog', ...)`
35+
* In the [layouts guide](layout_tutorial.html), we will see how `data_layer.fields` uses these namespaces to identify what data to render.
36+
* The second argument to the function is a list of values: the name of a [predefined adapter](../api/module-LocusZoom_Adapters.html) that defines how to retrieve this data, followed by an object of configuration options (like url and params) that control which data will be fetched. Each type of data has its own options; see the documentation for a guide to available choices.
37+
* You are not limited to the types of data retrieval built into LocusZoom.js. See "creating your own adapter" or the [guide to plugins](plugins_tutorial.md) for more information.
38+
39+
### What should the data look like?
40+
In theory, LocusZoom.js can display whatever data it is given: layouts allow any individual layout to specify what fields should be used for the x and y axes.
41+
42+
In practice, it is much more convenient to use pre-existing layouts that solve a common problem well out of the box: the set of options needed to control point size, shape, color, and labels is rather verbose, and highly custom behaviors entail a degree of complexity that is not always beginner friendly. For basic LocusZoom.js visualizations, our default layouts assume that you use the field names and format conventions defined in the [UM PortalDev API docs](https://portaldev.sph.umich.edu/docs/api/v1/). This is the quickest way to get started.
43+
44+
Most users will only need to implement their own way of retrieving GWAS summary statistics; the other annotations are standard datasets and can be freely used from our public API. For complex plots (like annotations of new data), see our [example gallery](https://statgen.github.io/locuszoom).
45+
46+
## How data gets to the plot
47+
If you are building a custom tool for exploring data, it is common to show the same data in several ways (eg, a LocusZoom plot next to a table of results). The user will have a better experience if the two widgets are synchronized to always show the same data, which raises a question: which widget is responsible for making the API request?
48+
49+
In LocusZoom.js, the user is allowed to change the information shown via mouse interaction (drag or zoom to change region, change LD calculations by clicking a button... etc). This means that LocusZoom must always be able to ask for the data it needs, and initiate a new API request if necessary: a *pull* model. This contrasts with static plotting libraries like R which show whatever data they are given (a *push* approach).
50+
51+
The act of contacting an external API, and fetching the data needed, is coordinated by *Adapters*. It is possible to share data with other widgets on the page via event callbacks, so that those widgets retrieve the newest data whenever the plot is updated (see `subscribeToData` in the [guide to interactivity](interactivity_tutorial.html) for details).
52+
53+
### Not every web page requires an API
54+
LocusZoom.js is designed to work well with REST APIs, but you do not need to create an entire web server just to render a single interactive plot. As long as the inputs can be transformed into a recognized format, they should work with the plot.
55+
56+
Some examples of other data retrieval mechanisms used in the wild are:
57+
58+
* Loading the data from a static JSON file (this can be as simple as giving the URL of the JSON file, instead of the URL of an API server!). Many bioinformaticians are comfortable converting between text files, so this is a low-effort way to get started... but static files always return the same data, and they return all of it at once. This can be limiting for big datasets or "jump to region" style interactivity.
59+
* Fetching the data from a Tabix-indexed file in an Amazon S3 bucket (via the [lz-tabix-source](../api/module-ext_lz-tabix-source.html) plugin; you will need to write your own function that parses each line into the required data format). This is exactly how our chromatin coaccessibility demo works!
60+
* Loading the data into a "shared global store" like vuex for a reactive single-page application, and asking LocusZoom to query the store instead of contacting a REST API directly. This is relatively advanced, but it can be useful if many page widgets need to coordinate and share a lot of data.
61+
62+
### Example: Loading data from static JSON files
63+
One way to make a LocusZoom plot quickly is to load the data for your region in a static file, formatted as JSON objects to look like the payload from our standard REST API. The key concept below is that instead of a server, the URL points to the static file. This demonstration is subject to the limits described above, but it can be a way to get started.
64+
65+
```javascript
66+
data_sources = new LocusZoom.DataSources()
67+
.add("assoc", ["AssociationLZ", {url: "assoc_10_114550452-115067678.json", params: {source: null}}])
68+
.add("ld", ["LDLZ", { url: "ld_10_114550452-115067678.json" }])
69+
.add("gene", ["GeneLZ", { url: "genes_10_114550452-115067678.json" }])
70+
.add("recomb", ["RecombLZ", { url: "recomb_10_114550452-115067678.json" }])
71+
.add("constraint", ["GeneConstraintLZ", { url: "constraint_10_114550452-115067678.json" }]);
72+
```
73+
74+
### What if my data doesn't fit the expected format?
75+
The built-in adapters are designed to work with a specific set of known REST APIs and fetch data over the web, but we provide mechanisms to customize every aspect of the data retrieval process, including how to construct the query sent to the server and how to modify the fields returned. See the guidance on "custom adapters" below.
76+
77+
In general, the more similar that your field names are to those used in premade layouts, the easier it will be to get started with common tasks. Certain features require additional assumptions about field format, and these sorts of differences may cause behavioral (instead of cosmetic) issues. For example:
78+
79+
* In order to fetch LD information relative to a specific variant, the data in the summary statistics must provide the variant name as a single string that combines chromosome, position, reference, and alt allele, like `1:12_A/C`. Our builtin LD adapter tries to handle the common marker formats from various programs and normalize them into a format that the LD server will understand, but we cannot guess the ref and alt allele. Following the order of values and using a known format will ensure best results.
80+
* The JavaScript language is not able to accurately represent very small pvalues (numbers smaller than ~ 5e-324), and will truncate them to 0, changing the meaning of your data. For this reason, we recommend sending your data to the web page already transformed to -log pvalue format; this is much less susceptible to problems with numerical underflow.
81+
82+
# Creating your own custom adapter
83+
## Re-using code via subclasses
84+
Most custom sites will only need to change very small things to work with their data. For example, if your REST API uses the same payload format as the UM PortalDev API, but a different way of constructing queries, you can change just one function and define a new data adapter:
85+
86+
```javascript
87+
const AssociationLZ = LocusZoom.Adapters.get('AssociationLZ');
88+
class CustomAssociation extends AssociationLZ {
89+
getURL(state, chain, fields) {
90+
// The inputs to the function can be used to influence what query is constructed. Eg, the current view region is stored in `plot.state`.
91+
const {chr, start, end} = state;
92+
// Fetch the region of interest from a hypothetical REST API that uses query parameters to define the region query, for a given study URL `/gwas/<id>/`
93+
return `${this.url}/${this.params.study_id}/?chr=${encodeURIComponent(chr)}&start=${encodeURIComponent(start)}&end${encodeURIComponent(end)}`
94+
}
95+
}
96+
// A custom adapter should be added to the registry before using it
97+
LocusZoom.Adapters.add('TabixAssociationLZ', TabixAssociationLZ);
98+
99+
// From there, it can be used anywhere throughout LocusZoom, in the same way as any built-in adapter
100+
data_sources.add('mystudy', ['CustomAssociation', {url: 'https://data.example/gwas', params: { study_id: 42 }}]);
101+
```
102+
103+
In the above source, an HTTP GET request will be sent to the server every time that new data is requested. If further control is required (like sending a POST request with custom body), you may need to override additional methods such as [fetchRequest](../api/module-LocusZoom_Adapters-BaseAdapter.html#fetchRequest). See below for more information, then consult the detailed developer documentation for details.
104+
105+
Common types of data retrieval that are most often customized:
106+
107+
* GWAS summary statistics
108+
* This fetches the data directly with minor cleanup. You can customize the built-in association adapter, or swap in another way of fetching the data (like tabix).
109+
* User-provided linkage disequilibrium (LD)
110+
* This contains special logic used to combine association data (from a previous request) with LD information. To ensure that the matching code works properly, we recommend matching the payload format of the public LDServer, but you can customize the `getURL` method to control where the data comes from.
111+
* PheWAS results
112+
113+
## The request lifecycle
114+
The adapter performs many functions related to data retrieval: constructing the query, caching to avoid unnecessary network traffic, and parsing the data into a transformed representation suitable for use in rendering.
115+
116+
Methods are provided to override all or part of the process, called in roughly the order below:
117+
118+
```javascript
119+
getData(state, fields, outnames, transformations)
120+
getRequest(state, chain, fields)
121+
getCacheKey(state, chain, fields)
122+
fetchRequest(state, chain, fields)
123+
getURL(state, chain, fields)
124+
parseResponse(resp, chain, fields, outnames, transformations)
125+
normalizeResponse(data)
126+
annotateData(data, chain)
127+
extractFields(data, fields, outnames, trans)
128+
combineChainBody(data, chain, fields, outnames)
129+
```
130+
131+
The parameters passed to getData are as follows:
132+
133+
* state - this is the current "state" of the plot. This contains information about the current region in view (`chr`, `start`, and `end`), which is often valuable in querying a remote data source for the data in a given region.
134+
* fields - this is an array of field names that have been requested from this data source. Note that the "namespace:" part of the name has been removed in this array. Note: **most data adapters will return *only* the fields that are requested by a data layer**. Each data layer can request a different set of fields, and thus **different parts of the plot may have a different view of the same data.**
135+
* outnames - this is an array with length equal to fields with the original requested field name. This value contains the data source namespace. The data output for each field should be given the name in this array. This is rarely used directly.
136+
* transformations - this is an array with length equal to fields with the collection of value transformations functions specified by the user to be run on the returned field. This is rarely used directly.
137+
138+
### Step 1: Fetching data from a remote server
139+
The first step of the process is to retrieve the data from an external location. `getRequest` is responsible for deciding whether the query can be satisfied by a previously cached request, and if not, sending the response to the server. At the conclusion of this step, we typically have a large unparsed string: eg REST APIs generally return JSON-formatted text, and tabix sources return lines of text for records in the region of interest.
140+
141+
Most custom data sources will focus on customizing two things:
142+
* getURL (how to ask the external source for data)
143+
* getCacheKey (decide whether the request can be satisfied by local data)
144+
* By default this returns a string based on the region in view: `${state.chr}_${state.start}_${state.end}`
145+
* You may need to customize this if your source has other inputs required to uniquely define the query (like LD reference variant, or calculation parameters for credible set annotation).
146+
147+
### Step 2: Formatting and parsing the data
148+
The `parseResponse` sequence handles the job of parsing the data. It can be used to convert many different API formats into a single standard form. There are four steps to the process:
149+
150+
* `normalizeResponse` - Converts any data source response into a standard format. This can be used when you want to take advantage of existing data transformation functionality of a particular adapter (like performing an interesting calculation), but your data comes from something like a tabix file that needs to be transformed first.
151+
* Internally, most data layer rendering types assume that data is an array, with each data element represented by an object: `[{a_field: 1, other_field: 1}]`
152+
* Some sources, such as the UM PortalDev API, represent the data in a column-oriented format instead. (`{a_field: [1], other_field: [1]}`) The default adapter will attempt to detect this and transform those columns into the row-based one-record-per-datum format.
153+
* `annotateData` - This can be used to add custom calculated fields to the data. For example, if your data source does not provide a variant marker field, one can be generated in javascript (by concatenating chromosome:position_ref/alt), without having to modify the web server.
154+
* `extractFields` - Each data layer receives only the fields it asks for, and the data is reformatted in a way that clearly identifies where they come from (the namespace is prefixed onto each field name, eg `{'mynamespace:a_field': 1}`).
155+
* The most common reason to override this method is if the data uses an extremely complex payload format (like genes), and a custom data layer expects to receive that entire structure as-is. If you are working with layouts, the most common sign of an adapter that does this is that the data layer asks for a nonexistent field (`genes:all`: a synthetic field because data is only retrieved if at least one field is used)
156+
* `combineChainBody`: If a single data layer asks for data from more than one source, this function is responsible for combining several different pieces of information together. For example, in order to show an association plot with points colored by LD, the LD adapter implements custom code that annotations the association data with matching LD information. At the end of this function, the data layer will receive a single combined record per visualized data element.
157+
158+
#### Working with the data "chain"
159+
Each data layer is able to request data from multiple different sources. Internally, this process is referred to as the "chain" of linked data requested. LocusZoom.js assumes that every data layer is independent and decoupled: it follows that each data layer has its own chain of requests and its own parsing process.
160+
161+
This chain defines how to share information between different adapters. It contains of two key pieces:
162+
* `body`: the actual consolidated payload. Each subsequent link in the chain receives all the data from the previous step as `chain.body`
163+
* `headers`: this is a "meta" section used to store information used during the consolidation process. For example, the LD adapter needs to find the most significant variant from the previous step in the chain (association data) in order to query the API for LD. The name of that variant can be stored for subsequent use during the data retrieval process.
164+
165+
Only `chain.body` is sent to the data layer. All other parts of the chain are discarded at the end of the data retrieval process.
166+
167+
## See also
168+
LocusZoom.js is able to share its data with external widgets on the page, via event listeners that allow those widgets to update whenever the user interacts with the plot (eg panning or zooming to change the region in view). See `subscribeToData` in the [guide to interactivity](interactivity_tutorial.html) for more information.

0 commit comments

Comments
 (0)