Skip to content

SynDiffix NYC Taxi Example

Paul Francis edited this page Sep 19, 2023 · 4 revisions

Several views of the NYC taxi data

Each year, New York City publishes data for every taxi ride. There is one row per ride. The columns include start and end time, start and end location, fare, number of passengers, driver number and taxi number among others.

We synthesized one day of this data using SynDiffix, mostly.ai, and CTGAN. We synthesized three columns, trip start hour, trip start latitude, and trip start longitude, and hour. There were 1,048,575 rides in total. As SynDiffix is currently designed, it automatically generates the same number of rides as the original data (plus noise, which is this case was -2, or 1,048,573 rides). We requested that CTGAN and mostly.ai generate the same number of rides. (Initially mostly.ai only produced 1M rows of data, so we requested an additional 48,575 rides.)

The following displays a large portion of NYC as a heatmap of the number of ride pickup locations for a single hour. Heatmap boxes are excluded when the number of rides is less than 5.

taxi-zoom-out

Visually the SynDiffix data is very close to the original data. A few boxes are missing or added, and the intensity is slightly different here and there, but overall SynDiffix is quite accurate. Mostly.ai and CTGAN both show a higher density of rides around midtown Manhattan, but overall the mostly.ai rides are quite spread out compared to the original data, and the CTGAN rides are extremely spread out.

The following is a scatterplot of rides originating around La Guardia airport for the same hour.

taxi-lg-points

Visually, SynDiffix matches the original data most closely. SynDiffix tends to compress and hide data points, while mostly.ai tends to spread them out. Put another way, SynDiffix protects sparse data by hiding it, while mostly.ai protects sparse data by adding more data. CTGAN also spreads the data, but it is extreme and useless as a scatterplot.

The following displays the same data as a heatmap, showing the number of rides in each grid box.

taxi-lg-heat

Visually the SynDiffix heatmap is very close to the original heatmap. One low-count box has been dropped, and another added, but otherwise the boxes match. The counts are also very similar. All of the counts are within 10 of each other, and most of the counts are within 5.

The spreading effect of mostly.ai is again visually noticeable here. The counts, however, are less accurate than SynDiffix by an order of magnitude. Some counts are off by more than 100, and many by more than 50.

Clone this wiki locally