Fueling Trino Large-scale Geospatial Analysis with the Varada Trino Connector

By Guy Mast
I
February 17, 2022
February 17, 2022

Improve performance by up to 42X with Varada by automatically accelerating queries with indexing and smart cache management

By Guy Mast

In our last article, we discussed two methods for running geospatial analysis with Trino and the Hive connector and explored a few optimization techniques to improve performance/cost. If you recall, we were able to improve performance by a factor of 1.7-3.9x for our queries but still ended up with ~1m-2m response time when querying a 4.4B rows dataset on a 4 nodes cluster. 

In this post, we’ll explore how the Varada connector can significantly improve the cost/performance ratio, and see how the Community Edition of Varada Trino Connector can instantly accelerate and improve the interactivity for Trino geospatial queries, enabling large-scale analysis.

What’s missing for enabling a geospatial analysis at scale

Geospatial analytics on big data is particularly challenging because geospatial operations such as spatial joins, nearest neighbor, or point-in-polygon are inherently highly selective and computationally expensive without using an auxiliary data structure to map the geo-locations to the underlying rows. Many Trino users want to leverage their existing investment in Trino to run spatial analysis at scale — but face challenges in balancing cost/performance with business needs. 

A common question on the Trino community forum and Slack channel is whether partitioning or indexing could enable faster performance of geospatial data analysis. Here are a few examples:

Question #1

“I’m looking for a solution to query massive point cloud stored in iceberg with Trino. While doing some tests, I was wondering if there is a way to improve geospatial queries in Trino? something like, building and storing a KDTree, pruning unwanted files by using a geohash or something ?”

Question #2

“Hello, I’m evaluating Trino (very green just reading docs to see if it fits our use case) and have some questions about its spatial capabilities. Does it use spatial indexes for joins between different systems? E.g. When joining points in Hive to polygons elsewhere, will it create spatial indexes in presto itself for the join?”

Question #3

Hi all, I am curious to hear about the community’s experiences with large spatial joins and their performance. We’re evaluating Trino for ad-hoc analytics on our data in AWS (in parquet). 

Answer:

In the event that you are dealing with a very large geo data set, partitioning your events into smaller time intervals will allow the event data to be broadcast. You can also partition the geo data into regions, and run a larger number of smaller queries against individual or subsets of regions.

Challenges Partitioning Data by Region

Solutions that often come up deal with partitioning data by region, or with suggestions to have any kind of index within Trino.

It’s worth mentioning that partitioning by high cardinality columns such as region might help a bit, but will also surface new challenges such as long-tailed data distribution across partitions and extremely small files per partition, which can significantly degrade query performance.

Indexing the data using kdtree or grid systems (such as Bing Tiles) are common approach for dealing with large-scale geospatial workloads. Grids approximate geo features such as polygons and multi-polygons with a fixed set of identifiable cells, reducing geospatial operations and cells scanned at the cost of accuracy and granularity.

It’s clear that Trino is missing a key component to enable large-scale geospatial analysis on the data lake that meets business and cost/performance requirements.

Free dynamic geospatial indexing with Varada’s Trino Connector

The Varada Trino Connector automatically accelerates queries with dynamic indexing and smart cache management. It’s completely plug-and-play, just start running your queries and the connector will create indexes and help you deliver the best possible performance and concurrency for geospatial queries. 

The connector is easy to install and seamlessly integrates with existing Trino clusters deployed on AWS. It’s still your Trino — no additional ETLs, query rewrites, footprint to your existing architecture, or changes to how you manage clusters required — just autonomously fast.

Running the same experiment as before (detailed setup, queries, and results at the end of the article), now using the Community Edition of Varada’s Trino Connector, we were able to improve performance by a factor of 7.5-42X!.

With Varada, you can run geospatial analysis at scale, while boosting cost/performance, future-proofing SLA, and growing the volume and dimensionality of analyzed data. The cool part: you can start doing it today and for free with the Varada Connector Community Edition!

Results:

Mini Benchmark Data

Setup

4Xi3.4xlarage machines on AWS were used for executing queries on both the Trino-Hive and Trino-Varada connectors.

Results

Use CaseQueryHive (s)Varada (s)Varada Improvement vs. Hive
Use case #1-
Distance from points of interest 100m
Query #3.1119913.22222222
Use case #2-
Distance from points of interest 500m
Query #3127177.470588235
Use case #3-
lookups in a geofence
Query #41061.641.66666667

Queries

Query #1 (500M)

select

count(distinct rider_id)

from

hive.ride_sharing_dataset.trips_data_bing as t,

hive.ride_sharing_dataset.places as p

where

great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;

Query #1.1 (100M)

select

count(distinct rider_id)

from

hive.ride_sharing_dataset.trips_data_bing as t,

hive.ride_sharing_dataset.places as p

where great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.1;

Query #2

select

count(distinct driver_id)

from

ride_sharing_dataset.trips_data_bing t

where

ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752, -122.45635986328125 37.072130604487681))’),

st_point(lon,lat));

Query #2.1 (adding bounding box with lat/lon)

select

count(distinct driver_id)

from

ride_sharing_dataset.trips_data_bing t

where

lon between -122.45635986328125 and -122.45567321777344

and lat between 37.72130604487681 and 37.72184917678752

and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),

st_point(lon,lat));

Query #3

with p as (

select lat,lon,qkey

from

ride_sharing_dataset.places

cross join unnest (transform(bing_tiles_around(lat,lon,19,0.5),t -> bing_tile_quadkey(t))) as t(qkey))

— use that list to filter matching tiles

 select

count(distinct rider_id)

from

ride_sharing_dataset.trips_data_bing as t,

p

where

v.qkey = r.qkey

— match the bounding tile

and great_circle_distance(t.lat,t.lon,p.lat,p.lon) <= 0.5;

Query #4

with q as (

select qkey

from

(

values (‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752)) ‘) ) as a(p)

cross join unnest (transform(geometry_to_bing_tiles(ST_Polygon(a.p),19),

t -> bing_tile_quadkey(t))) as t(qkey) )

select

count(distinct driver_id)

from

ride_sharing_dataset.trips_data_bing t,

q

where

q.qkey = t.qkey

and ST_Contains(ST_Polygon(‘POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))’),

st_point(lon,lat));

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept