Few days ago, I watched one TED speech Your body language shapes who you are (http://www.ted.com/playlists/171/the_most_popular_talks_of_all). Suddenly one of major reason why I failed Amazon onsite appears clearly to me: I did too much dominant postures instead of show how modest I am. Imagine that, will you hire a candidate who was sitting on the table and you were keeping stretching or opening your arm to try to dominant back?
Jesus, I wish I watched it before my interview.
Quick tips or notes that probably reflects 20 percent of knowledge that usually does 80 percent of job.
Tuesday, December 02, 2014
Monday, December 01, 2014
Understanding HBase Table Definition
Hbase: distributed column-oriented database built on top of HDFS. It’s used when you require real-time (read/write) random-access to very large data set. Its table is like table in RDBMS, but cells are versioned, rows are sorted, and columns can be added on the fly as long as column family is already there. Has simpler API for basic CRUD operations, plus a scan function to iterate over large key ranges.
It's able to deal with many small files and low latency situations.
A table in HBase is a sparse, distributed, persistent, multidimensional map, which is indexed by row key, column key, and a timestamp. looks like this:
(Table, RowKey, column Family, Column, Timestamp) → Value
Putting it in data structure, it's like this:
SortedMap< //table
RowKey, //a row
List< //content of a row, is another map,
SortedMap< //one map is one column family
Column, List< //within on column family, you can dynamically have multiple columns
Value, Timestamp //for each key, it can have multiple version of values, sorted descendingly
>
>
>
>
I struggled the first to understand it's table definition, mainly because of wrong impression on column. it's said column can be added on the fly, but actually column here is just a key of a key-value pair within a grid value, if you treat column family the column as concept in RDBMS.
So to summarize, at least what will help me to understand it, is to think it this way:
1. viewing it as spreadsheet or RDBMS table, columns are column families
2. within each grid in the table, the data is organized in a form of a map of [key value+timestamp]-->value. so far, this maps to 2D normal table model
3. if you want to go further and insist to call key values in grid's data as columns, now you have a 3D vision of value of keys (row+column family+timstamp). In 3D version of view, each piece of data is called a cell. a cell is tagged/located by (RowKey, column Family, Column, Timestamp)
Now checking further the table scan result from hbase, it calls it column+cell. column=column family:key identifier, with timestamp, it tagged a value. Here is an example.
ROW COLUMN+CELL
row1 column=cf1:key1, timestamp=1417140625098, value=value1
row1 column=cf1:key2, timestamp=1417140642014, value=value2
row1 column=cf1:key3, timestamp=1417141283628, value=newValue
row1 column=cf2:key1, timestamp=1417140752958, value=value1
row1 column=cf2:key2, timestamp=1417140761428, value=value2
row2 column=cf1:key1, timestamp=1417141754748, value=ama
row2 column=cf1:key21, timestamp=1417140781886, value=value21
row2 column=cf1:key22, timestamp=1417140892737, value=value22
row2 column=cf2:key1, timestamp=1417140909231, value=value1
Now check the put statement:
put "test","row1","cf1:key3","value4"
It is to insert a value into a cell tagged by: row1+column family+column, in the cell, there are multiple versions of data, up most value is displayed by simple scan statement.
I drew a diagram to help myself to understand it
Subscribe to:
Posts (Atom)