Twenty Percent Knowledge: December 2014

Hbase: distributed column-oriented database built on top of HDFS. It’s used when you require real-time (read/write) random-access to very large data set. Its table is like table in RDBMS, but cells are versioned, rows are sorted, and columns can be added on the fly as long as column family is already there. Has simpler API for basic CRUD operations, plus a scan function to iterate over large key ranges.

It's able to deal with many small files and low latency situations.

A table in HBase is a sparse, distributed, persistent, multidimensional map, which is indexed by row key, column key, and a timestamp. looks like this:

(Table, RowKey, column Family, Column, Timestamp) → Value

Putting it in data structure, it's like this:

SortedMap< //table

RowKey, //a row

List< //content of a row, is another map,

SortedMap< //one map is one column family

Column, List< //within on column family, you can dynamically have multiple columns

Value, Timestamp //for each key, it can have multiple version of values, sorted descendingly

I struggled the first to understand it's table definition, mainly because of wrong impression on column. it's said column can be added on the fly, but actually column here is just a key of a key-value pair within a grid value, if you treat column family the column as concept in RDBMS.

So to summarize, at least what will help me to understand it, is to think it this way:

1. viewing it as spreadsheet or RDBMS table, columns are column families

2. within each grid in the table, the data is organized in a form of a map of [key value+timestamp]-->value. so far, this maps to 2D normal table model

3. if you want to go further and insist to call key values in grid's data as columns, now you have a 3D vision of value of keys (row+column family+timstamp). In 3D version of view, each piece of data is called a cell. a cell is tagged/located by (RowKey, column Family, Column, Timestamp)

Now checking further the table scan result from hbase, it calls it column+cell. column=column family:key identifier, with timestamp, it tagged a value. Here is an example.

ROW COLUMN+CELL

row1 column=cf1:key1, timestamp=1417140625098, value=value1

row1 column=cf1:key2, timestamp=1417140642014, value=value2

row1 column=cf1:key3, timestamp=1417141283628, value=newValue

row1 column=cf2:key1, timestamp=1417140752958, value=value1

row1 column=cf2:key2, timestamp=1417140761428, value=value2

row2 column=cf1:key1, timestamp=1417141754748, value=ama

row2 column=cf1:key21, timestamp=1417140781886, value=value21

row2 column=cf1:key22, timestamp=1417140892737, value=value22

row2 column=cf2:key1, timestamp=1417140909231, value=value1

Now check the put statement:

put "test","row1","cf1:key3","value4"

It is to insert a value into a cell tagged by: row1+column family+column, in the cell, there are multiple versions of data, up most value is displayed by simple scan statement.

I drew a diagram to help myself to understand it

Twenty Percent Knowledge

Tuesday, December 02, 2014

Never Do This Again

Monday, December 01, 2014

Understanding HBase Table Definition