R-Tree for Spatially Joining Multi-polygon Data

My dataset visualized in LBNL’s CityBES software. Can you identify some famous NYC landmarks here?

R-tree illustration in a 3D Euclidian space

During my research internship at Lawrence Berkeley National Laboratory (LBNL) supervised my Dr. Tianzhen Hong (Head of Urban Systems Group), I came across with an interesting task–to match and join building records based on their geographical footprints (the multi-polygons marked by GPS coordinates). I had never pondered this question though–as I always tended to think that for database systems, matching or joining has to be done in an “exact” way, in which you are joining data rows base on some exact column values: for example, you want to perform a joining of two tables based on the student id–a record that base to be matched exactly for two rows be joined together.

For geo-data matching, this is not the case. We are rather joining data on an abstract level: we treat two building polygons as belong to the same building if they intersect sufficiently, or we treat two building records (one with polygon and one with a single GPS point) as the same building, if the GPS point is included in the polygon. That, exactly, is what R-trees are good with: figuring out whether two shapes overlap, intersect or include one another.

2D R-tree with comparative graph and tree structures

From Wikipedia: The key idea of the data structure is to group nearby objects and represent them with their minimum bounding rectangle in the next higher level of the tree; the “R” in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation of an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set.

Therefore, although building an R-tree might be computationally expensive, but querying it will be fast, thanks to the tree structure. Using this knowledge, I implemented a 2-dimensional R-tree on building records in Fresno City, New York City, and the LA County, and was able to fast join building records (between CityGML building geometric data, land parcel Assessor’s records, and real-estate data queried from Zillow and ATTOM Data Solutions). Shown in Figure 1 is what one of my final products looks like: the building records in lower Manhattan including 3d building geometric shapes, building types, energy consumptions, etc.