ODL – Optimal Data Layout

ODL – Optimal Data Layout

Introduction

This website documents and demonstrates the accomplishments of the NSF EarthCube project “Optimal Data Layout for Scalable Geophysical Analysis in a Data-Intensive Environment” (Optimal Data Layout in short, or simply ODL), a collaborative project between PI Prof. Hongfeng Yu of University of Nebraska-Lincoln (UNL) and Co-PI Dr. Kwo-Sen Kuo of University of Maryland-College Park (UMD). ODL aims to identify the optimal data layout strategy to support scalable high-performance geoscience data-intensive analyses, a key part in the solution addressing the geoscience Big Data challenges.

Based on the insights gained from examining the bandwidths of data movement links connecting computer and storage, we conclude that a solution that optimizes Value for geoscience Big Data analysis should feature the following characteristics:

  • Data volume to be moved must be commensurate with connection bandwidth to avoid network congestion;
  • Data placements should be spatiotemporally aligned to minimize unnecessary data movements; and
  • It takes a tightly coupled compute-storage approach to take full advantage of spatiotemporal data placement alignment (DPA).

Our Solution

Datasets in Brief

A browser Map Interface has been constructed to conduct interactive analyses of diverse datasets through visual/graphical interactions, demonstrating succinctly the accomplishments of our Optimal Data Layout (ODL) project. The datasets accessible by the interface for analysis are summarized in the following table. Further details of these datasets can be found on the Data page.

Spatial Temporal Remarks
Resolution Coverage Resolution Duration
MERRA-2 (PRECTOT) 0.625°×0.5° Global 1 hr 3 mon Grid
MERRA-2 (Blizzard) 0.625°×0.5° Global 1 hr 3 mon Grid
NMQ 0.01°×0.01° CONUS 5 min 3 mon Grid
TRMM 4~5 km Tropics ~ 3 mon Swath

The Map Interface

ODL’s Map Interface for visual analytics demonstration is based on the Google Maps interface. Its components and functions are generally intuitive. However, consulting the Map Interface User Guide page for brief but helpful instructions should ensure an enjoyable experience.

The interface is supported by a backend cluster of 16 lightweight nodes (one 8-core CPU with 32 GB DRAM per node) running SciDB (release 16.9). Due to the cluster’s middling capacity and capability, the maximum number of concurrent users of the interface is limited to 10.

Any combination of the 4 datasets mentioned above can be displayed in synchronized hourly animation. Data for the animation are fetched in real time from SciDB. Since STARE is used to ensure spatiotemporal DPA, unnecessary communication among the nodes is mostly eliminated, allowing fair performance [1]. However, a user’s experience still depends on the effective bandwidth available. In addition, since data need to be moved to the client side for rendition, the animation speed will deteriorate as the volume of data to be moved increases.

Rudimentary analyses, such as time series and percentiles, are available for demonstration. Re-gridding is not yet supported through the interface.

Significance and Implications

The capability and performance demonstrated in ODL have far-reaching significance and implications. The connected component labeling (CCL) capability alone will greatly enrich climatological statistics. Beyond identifying the presence of phenomenon in data, CCL enables the tracking of events to establish them as individual episodes (much like the tracking of hurricanes), from which event-based statistics can be obtained, such as “the average number of blizzards per year globally” (as a very basic example). The Blizzard page presents a few more sophisticated examples. We desperately need these highly contextual statistics to better diagnose our models by targeting specific processes and thus improve them. However, the conventional means to obtain such statistics in existing practices are exceedingly (but needlessly) complex, time-consuming, and labor-intensive. With the technology innovations and integration represented in ODL, the process becomes more simplified, more streamlined, and more conducive to automation.

In addition, the ODL concept offers several more significant advantages:

  • SciDB is a multiuser system with sophisticated user, role, and namespace management, it provides a natural platform conducive to supporting frictionless collaboration.
  • Extending SciDB’s fundamental capabilities generally requires highly skilled software engineers, increasing software quality and traceability.
  • Once the extensions are implemented they become available to all users, improving reusability.
  • With ODL, both data and processing (i.e. analysis) are concentrated on the same cluster, localizing provenance collection and ensuring better reproducibility.

References

  1. Yu, L., M. L. Rilee, Y. Pan, F. Zhu, K.-S. Kuo, and H. Yu, 2017: Visual analytics with unparalleled variety scaling for big earth data. 2017 IEEE International Conference on Big Data (BIGDATA), Boston, MA, USA, 514–521.