Twenty Percent Knowledge: Some Better Answers

Talking about project

what people mostly interested in is the impact of it.

briefly what it is, then emphasis the impact of it: good result

for example, the bigdata concept and process I delivered made IT department not only be aware of this technology, but also has confident when later other department request IT's support of big data related course designs.

Do you know what is slow changing dimension?

Dimension that changes overtime, but slowing. for example, a production's promotion status, a person's gender.

There are ways to handle the situation,
type 0 solution ignores the change as a solution.
type 1 overrides old value with new value. this solution lose history so you will have trouble when examining data in the past.
type 2 create new record of dimension, keeping a date range the value is in effect.
type 3 adding new column. I do not prefer to use because it can not capture all the changes
type 4 solution is to have historical table and define date range each record is in effect
type 6 is 1+2+3. this is no better than type 4.

Describe how map reduce works?

In programming, you typically provides two functions.one is called mapper, another is called reducer.
mapper takes a list of key-value pairs, and transform them to another type of list of key-value pairs. mapreduce framework will typically do some shuffling on the output from mapper. reducer will be takes the shuffled result and generate another list of key-value pairs that is some kind of aggregation.
to be simple, mapper focuses on transformation, and reducer focuses on aggregation.

What's the benefit of distributed computing?

Parallel computing; divide and conquer; moving process instead of moving data that is important in big data processing since moving data to process is too expensive.

What is database sharding?

it is splitting data into smaller manageable pieces and support each pieces on cheap commodity hardware. it's a shared nothing architecture. usually is natively supported in NoSQL databases.

It belongs to horizontal partitioning.

SQL Server has database federation. Oracle have no native support so far. partitioning probably is the best bet in Oracle.

Or from web,
Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

What is partitioning?

From Oracle:
Partitioning allows a table, index, or index-organized table to be subdivided into smaller pieces, where each piece of such a database object is called a partition. Each partition has its own name, and may optionally have its own storage characteristics.

From the perspective of a database administrator, a partitioned object has multiple pieces that can be managed either collectively or individually. This gives the administrator considerable flexibility in managing partitioned objects. However, from the perspective of the application, a partitioned table is identical to a non-partitioned table;

It's horizontally dividing data into smaller pieces within one database.
usually automatically scheduled jobs need to run to dynamically apply the partitioning function as time goes.

What is difference between arraylist and linked list?

One is a list implementation based on array, another is a list implementation based on linked nodes. array boosts instant random access, but bad at insert and update because rest of list would have to do memory shifting. it's also waste memory when the newly extended array list contains only few elements in the newly allocated memory. Linked list occupies more memory because of the more complex underlying data structure, slower on access element because it has to traversal the list in order to find targeted element. but it is good at insert and delete operation. so use them for different situation. If your data is relatively static and random access is important, then use arraylist. If data is dynamic and update operation is overwhelming, then use linkedlist.

What is static variable?

It's also called class variable. no matter how many instances you created from class, there is only one variable that is shared within the instances.

What is Set operation?

operation between sets. for example, union, union all, intersect, minus/except
Oracle also provides similar functions for collections.
multiset except|intersect|union

What is correlated query?

it's a subquery that use value from other query. for each of rows in other query, correlated query will be evaluated once for each row in other query. performance might be issue if other query's data set is big.

How to optimize it?

converting it to some form of subqueries that is not correlated, such as a view, a subquery, a CTE(subquery factory) and then then use join instead of being part of where condition, it might calculate more data, but it runs only once.

Tell me about your experience on Hadoop

I set up 3 nodes hadoop clusters to demonstrate Hadoop ecosystems (mapReduce, pig, hive,zookeeper, hbase etc) to Langara's IT department.

I also took part in IBM's bigData hacking session to learn its bluemix cloud solution and BigInsight platform based on Hadoop.

What is inner join and outer join?

What will happen if outer join has duplication in one table? See two rows in left table, one row with same key in right table, what will be result in left join, right join and full join?

Inner join returns rows matching on cafeterias in both tables.
outer join returns also returns rows having no matching in the other table.

all three situations returns same result due to there is a matching on both side, duplication will be simply reflected in the result. all of them will return 2 rows. (I gave wrong answer on right join)

with
q1 as
(
select '1' as k1,'first row from q1' as des from dual
union all
select '1' as k1,'second row from q1' as des from dual
),
q2 as
(
select '1' as k1,'first row from q2' as des from dual
)
select * from q1 full outer join q2 on q1.k1=q2.k1;
--select * from q1 left outer join q2 on q1.k1=q2.k1;
--select * from q1 right outer join q2 on q1.k1=q2.k1;

If there's no matching keys, the left join returns 2, right join returns 1 and full joins return 3 row.
with
q1 as
(
select '1' as k1,'first row from q1' as des from dual
union all
select '1' as k1,'second row from q1' as des from dual
),
q2 as
(
select '2' as k1,'first row from q2' as des from dual
)
select * from q1 full outer join q2 on q1.k1=q2.k1;
--select * from q1 left outer join q2 on q1.k1=q2.k1;
--select * from q1 right outer join q2 on q1.k1=q2.k1;

What's the benefit of distributed processing?

My mind set was not switched to right channel, but afterward, I thought he should be expecting to here these:
--divide and conquer
--distributed work load
--parallel processing
--moving process instead of moving data
--cheap computer to delivery super calculation power
--efficient use of computing powser
--scaling out easily and flexible than centralized computing model

What do you keep in mind while designing a servier?

This can be served by SOA service-orientation principles:

standardized service contract
loose coupling
abstraction
reusability
autonomy
statelessness
discoverability
composability

What to do to delivery Availability?

Availability basically means clustered services so that in order to be clustered together
service has to be designed and implemented with these in mind:
autonomy
statelessness

What is scalability?

Sociability is basically a way to fulfill changed demand. it's better described as elasticity. meaning
it scales up or down. in order to scale the service, service should be able to cluster together. its designed should put these in mind:
autonomy
statelessness

Twenty Percent Knowledge

Friday, May 22, 2015

Some Better Answers