It's able to deal with many small files and low latency situations.
A table in HBase is a sparse, distributed, persistent, multidimensional map, which is indexed by row key, column key, and a timestamp. looks like this:
(Table, RowKey, column Family, Column, Timestamp) → Value
Putting it in data structure, it's like this:
SortedMap< //table
RowKey, //a row
List< //content of a row, is another map,
SortedMap< //one map is one column family
Column, List< //within on column family, you can dynamically have multiple columns
Value, Timestamp //for each key, it can have multiple version of values, sorted descendingly
>
>
>
>
I struggled the first to understand it's table definition, mainly because of wrong impression on column. it's said column can be added on the fly, but actually column here is just a key of a key-value pair within a grid value, if you treat column family the column as concept in RDBMS.
So to summarize, at least what will help me to understand it, is to think it this way:
1. viewing it as spreadsheet or RDBMS table, columns are column families
2. within each grid in the table, the data is organized in a form of a map of [key value+timestamp]-->value. so far, this maps to 2D normal table model
3. if you want to go further and insist to call key values in grid's data as columns, now you have a 3D vision of value of keys (row+column family+timstamp). In 3D version of view, each piece of data is called a cell. a cell is tagged/located by (RowKey, column Family, Column, Timestamp)
Now checking further the table scan result from hbase, it calls it column+cell. column=column family:key identifier, with timestamp, it tagged a value. Here is an example.
ROW COLUMN+CELL
row1 column=cf1:key1, timestamp=1417140625098, value=value1
row1 column=cf1:key2, timestamp=1417140642014, value=value2
row1 column=cf1:key3, timestamp=1417141283628, value=newValue
row1 column=cf2:key1, timestamp=1417140752958, value=value1
row1 column=cf2:key2, timestamp=1417140761428, value=value2
row2 column=cf1:key1, timestamp=1417141754748, value=ama
row2 column=cf1:key21, timestamp=1417140781886, value=value21
row2 column=cf1:key22, timestamp=1417140892737, value=value22
row2 column=cf2:key1, timestamp=1417140909231, value=value1
Now check the put statement:
put "test","row1","cf1:key3","value4"
It is to insert a value into a cell tagged by: row1+column family+column, in the cell, there are multiple versions of data, up most value is displayed by simple scan statement.
I drew a diagram to help myself to understand it
No comments:
Post a Comment