Monday, December 01, 2014

Understanding HBase Table Definition

Hbase: distributed column-oriented database built on top of HDFS. It’s used when you require real-time (read/write) random-access to very large data set. Its table is like table in RDBMS, but cells are versioned, rows are sorted, and columns can be added on the fly as long as column family is already there. Has simpler API for basic CRUD operations, plus a scan function to iterate over large key ranges.

It's able to deal with many small files and low latency situations.

A table in HBase is a sparse, distributed, persistent, multidimensional map, which is indexed by row key, column key, and a timestamp. looks like this:

(Table, RowKey, column Family, Column, Timestamp) → Value

Putting it in data structure, it's like this:
SortedMap<                         //table
    RowKey,                         //a row
                List<                   //content of a row, is another map, 
        SortedMap<                 //one map is one column family
            Column, List<          //within on column family, you can dynamically have multiple columns
                Value, Timestamp  //for each key, it can have multiple version of values, sorted descendingly
                                 >
                           >
                        >
           >

I struggled the first to understand it's table definition, mainly because of wrong impression on column. it's said column can be added on the fly, but actually column here is just a key of a key-value pair within a grid value, if you treat column family the column as concept in RDBMS.

So to summarize, at least what will help me to understand it, is to think it this way:
1. viewing it as spreadsheet or RDBMS table, columns are column families
2. within each grid in the table, the data is organized in a form of a map of [key value+timestamp]-->value. so far, this maps to 2D normal table model
3. if you want to go further and insist to call key values in grid's data as columns, now you have a 3D vision of value of keys (row+column family+timstamp). In 3D version of view, each piece of data is called a cell. a cell is tagged/located by (RowKey, column Family, Column, Timestamp)

Now checking further the table scan result from hbase, it calls it column+cell. column=column family:key identifier, with timestamp, it tagged a value. Here is an example.

ROW                   COLUMN+CELL                                               
 row1                 column=cf1:key1, timestamp=1417140625098, value=value1    
 row1                 column=cf1:key2, timestamp=1417140642014, value=value2    
 row1                 column=cf1:key3, timestamp=1417141283628, value=newValue  
 row1                 column=cf2:key1, timestamp=1417140752958, value=value1    
 row1                 column=cf2:key2, timestamp=1417140761428, value=value2    
 row2                 column=cf1:key1, timestamp=1417141754748, value=ama       
 row2                 column=cf1:key21, timestamp=1417140781886, value=value21  
 row2                 column=cf1:key22, timestamp=1417140892737, value=value22  
 row2                 column=cf2:key1, timestamp=1417140909231, value=value1  

Now check the put statement:
put "test","row1","cf1:key3","value4"

It is to insert a value into a cell tagged by: row1+column family+column, in the cell, there are multiple versions of data, up most value is displayed by simple scan statement.

I drew a diagram to help myself to understand it

No comments: