QbeastTable

This API is implemented to interact with your QbeastTable for exploring its basic information and optimization.

Creating an instance of QbeastTable is as easy as:

import io.qbeast.table.QbeastTable

val qbeastTable = QbeastTable.forPath(spark, "path/to/qbeast/table")

Table Information

The data in a qbeast table is organized in a sequence of Revisions which are defined by the indexing columns (current immutable), their transformers/transformations, and cube sizes. Each time you change any of these parameters when adding new data to the table, a corresponding new Revision is created. When using LinearTransformations, a new Revision is created when the current column min/max are exceeded.

Each revision represents a new index of the table, and the data is organized in a tree structure.

When appending data or optimizing the unindexed revision (RevisionID = 0), the data is added to the latest revision.

Use the following APIs to list the current table revisions.

qbeastTable.allRevisions() // all the current Revisions

qbeastTable.allRevisionIDs() // all the current Revision identifiers

The following methods provide information about the latest Revision of the table.

qbeastTable.indexedColumns() // the indexed columns and their transformers of the latest revision
qbeastTable.indexedColumns(revisionID) // indexed columns and their transformers for a specific revision

qbeastTable.cubeSize() // the cube size of the latest revision
qbeastTable.cubeSize(revisionID) // the cube size of a specific revision

Index Metrics

IndexMetrics provides an overview of a given revision of the index.

You can use it during development to compare indexes built using different indexing parameters such as the desiredCubeSize, columnsToIndex, as well as their transformers.

val metrics = qbeastTable.getIndexMetrics()

println(metrics)

// EXAMPLE OUTPUT
OTree Index Metrics:
revisionId: 1
elementCount: 309008
dimensionCount: 2
desiredCubeSize: 3000
indexingColumns: price:linear,user_id:linear
height: 8 (4)
avgFanout: 3.94 (4.0)
cubeCount: 230
blockCount: 848
fileCount: 61
bytes: 11605372

Multi-block files stats:
cubeElementCountStats: (count: 230, avg: 1343, std: 1186, quartiles: (1,292,858,2966,3215))
blockElementCountStats: (count: 848, avg: 364, std: 829, quartiles: (1,6,21,42,3120))
fileBytesStats: (count: 61, avg: 190252, std: 57583, quartiles: (113168,139261,182180,215136,332851))
blockCountPerCubeStats: (count: 230, avg: 3, std: 1, quartiles: (1,4,4,4,4))
blockCountPerFileStats: (count: 61, avg: 13, std: 43, quartiles: (1,3,5,5,207))

Inner cubes depth-wise stats:
cubeElementCountStats: (count: 58, avg: 3069, std: 58, quartiles: (2942,3026,3070,3109,3215))
blockElementCountStats: (count: 232, avg: 767, std: 1278, quartiles: (14,26,33,2859,3120))
depth avgCubeElementCount cubeCount blockCount cubeElementCountStd cubeElementCountQuartiles  avgWeight           
0     3026                1         4          0                   (3026,3026,3026,3026,3026) 0.009708486732493268
1     3120                2         8          72                  (3048,3048,3193,3193,3193) 0.1777036037942636  
2     3088                3         12         91                  (3003,3003,3048,3215,3215) 0.29339823296605566 
3     3078                7         28         45                  (3016,3046,3070,3110,3168) 0.3028351948969015  
4     3075                12        48         80                  (2942,3027,3084,3148,3214) 0.3853194593029685  
5     3056                21        84         45                  (2975,3016,3066,3090,3127) 0.557916611995329   
6     3071                12        48         37                  (2999,3055,3084,3111,3115) 0.71568557500894    

Leaf cubes depth-wise stats:
cubeElementCountStats: (count: 172, avg: 761, std: 733, quartiles: (1,163,584,1159,2966))
blockElementCountStats: (count: 616, avg: 212, std: 498, quartiles: (1,4,10,36,2877))
depth avgCubeElementCount cubeCount blockCount cubeElementCountStd cubeElementCountQuartiles avgWeight         
1     47                  2         4          46                  (1,1,94,94,94)            1515.9574468085107
2     358                 4         12         389                 (4,36,424,970,970)        210.8753971341503 
3     1010                5         20         994                 (314,330,677,767,2966)    5.5998339972396405
4     836                 14        45         951                 (4,38,315,1612,2552)      106.06026824809581
5     978                 27        103        827                 (63,299,616,1681,2666)    8.58027316980015  
6     811                 72        267        686                 (7,280,626,1253,2889)     21.17595889195726 
7     580                 48        165        590                 (3,143,515,779,2246)      76.10800345961682  

Metrics

1. General index metadata:

revisionId: the identifier for the index revision
elementCount: the number of records for this revision
dimensionCount: the number of dimensions in the index
desiredCubeSize: the target number of elements per cube
indexingColumns: the columns that are indexed. The format is column1:type,column2:type,... where type is the transformation type applied to the column
height: the height of the tree. In parentheses, we have the theoretical height if the index is perfectly balanced
avgFanOut: the average number of children per non-leaf cube. In parentheses, we have the max value for this metric, which is equal to 2 ^ dimensionCount
cubeCount: the total number of nodes in the index tree
blockCount: the number of blocks in the index tree
fileCount: the number of files in the index tree
bytes: the total size of the index tree in bytes

2. Multi-block files stats::

The following metrics describe the distribution of the cubes, blocks, and files in the index tree:

cubeElementCountStats: the statistics of the number of elements per cube
blockElementCountStats: the statistics of the number of elements per block
fileBytesStats: the statistics of the size of the files
blockCountPerCubeStats: the statistics of the number of blocks per cube
blockCountPerFileStats: the statistics of the number of blocks per file

For example, cubeElementCountStats: (count: 230, avg: 1343, std: 1186, quartiles: (1,292,858,2966,3215)) means that

there are230 cubes
the average number of elements per cube is 1343
the standard deviation is 1186
the quartiles are (min:1, 1stQ:292, 2ndQ:858, 3rdQ:2966, max:3215)

3. Depth-wise cube statistics:

The following metrics describe the distribution of the cubes in the index tree based on their depth, specifically for the inner cubes:

Inner cubes depth-wise stats:
cubeElementCountStats: (count: 58, avg: 3069, std: 58, quartiles: (2942,3026,3070,3109,3215))
blockElementCountStats: (count: 232, avg: 767, std: 1278, quartiles: (14,26,33,2859,3120))
depth avgCubeElementCount cubeCount blockCount cubeElementCountStd cubeElementCountQuartiles  avgWeight           
0     3026                1         4          0                   (3026,3026,3026,3026,3026) 0.009708486732493268
1     3120                2         8          72                  (3048,3048,3193,3193,3193) 0.1777036037942636  
2     3088                3         12         91                  (3003,3003,3048,3215,3215) 0.29339823296605566 
3     3078                7         28         45                  (3016,3046,3070,3110,3168) 0.3028351948969015  
4     3075                12        48         80                  (2942,3027,3084,3148,3214) 0.3853194593029685  
5     3056                21        84         45                  (2975,3016,3066,3090,3127) 0.557916611995329   
6     3071                12        48         37                  (2999,3055,3084,3111,3115) 0.71568557500894    

cubeElementCountStats and blockElementCountStats are the same as in the previous section. The same metrics are displayed for each level of the index tree.

Table Operations

Optimization

Through QbeastTable you can execute the Optimize operation to rearrange the files in the table according to the dictates of the index to render queries more efficient.

As more and more data are added to the table, the data layout starts to diverge from the optimal layout. The optimize operation can be used to reorganize the data to improve the query performance.

Parameters of Manual Optimization

Parameter	Description	Default
`revisionID`	The revision identifier to optimize.	Latest revision
`filePaths`	The list of table or revision files to optimize.	Empty Sequence
`fraction`	The fraction of the data of the table or revision to optimize.	1.0
`options`	A map of options for optimization. You can use it to configure `userMetadata` and `PreCommitHook`.	Empty map

Depending on the parameters you provide, the optimization process will be different.

Data can be added to the unindexed revision using the underlying table format (e.g. delta) without using qbeast. Whenever the data from the unindexed revision (RevisionID=0) is optimized, they will be added to the latest revision. If the transformation boundaries are exceeded, a new Revision will be created.

FULL_TABLE: Triggered when no parameters are provided, the optimization will be applied to ALL revisions.
TABLE_FILES: If only filePaths is provided, the optimization will be applied to these specific files. The files can be from different revisions, and the data will be re-indexed within their own revision (except the unindexed revision.)
TABLE_FRACTION: If only fraction is provided, each revision will re-index their own files to improve the sampling performance for all fraction <= fraction. For the indexed revisions, this can trigger an optimization size larger than fraction * revisionSize. A size proportional to fraction * revisionSize will be optimized for the unindexed revision.
FULL_REVISION: If only revisionID is provided, the entire target revision will be re-indexed.
REVISION_FRACTION: If revisionID and fraction are provided, the optimization will be applied to the specified fraction of the data from the specified revision. Its usage is recommended when the layout of a specific revision is known to be problematic.
REVISION_FILES: If revisionID and filePaths are provided, the optimization will be applied to the specified files from the specified revision.

Examples

// Optimize all revisions from the table. 
qbeastTable.optimize() // FULL_TABLE optimization

// Optimizing revision 2 for sampling fraction <= 10%. User metadata are stored in the commit info.
qbeastTable.optimize(2L, 0.1, Map["userMetadata" -> "user-metadata-for-optimization"])

// Optimizing all data from a given Revision
qbeastTable.optimize(2L)

// Optimizing a set of files from revision 1. Only the matched files will be optimized.
qbeastTable.optimize(1L, Seq("file1", "file2"))

// Optimizes a sequence of indexed files. The files can be from different revisions.
qbeastTable.optimize(Seq("file1", "file2"))

// Optimizing a sequence of unindexed files
qbeastTable.optimize(0L, Seq("file1", "file2"))

// Optimizing a fraction of unindexed data
qbeastTable.optimize(0L, fraction = 0.5)

Table Information​

Index Metrics​

Metrics​

1. General index metadata:​

2. Multi-block files stats::​

3. Depth-wise cube statistics:​

Table Operations​

Optimization​

Parameters of Manual Optimization​

Examples​