Twenty Percent Knowledge: 2015

Tuesday, October 20, 2015

Using XPath to find Maximum and Minimum values

XPath for minimum

/ama/e/@v[not(. > ../../e/@v)]

XPath for Maximum

/ama/e/@v[not(. < ../../e/@v)]

Monday, October 19, 2015

Note on Anomaly detection

https://en.wikipedia.org/wiki/Anomaly_detection

In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.

Friday, October 02, 2015

Convert Putty key to SSH key

assuming you already have private key that was generated by puttygen, do following steps to import the key to ssh.

1. load the private key to puttygen with it's load function
2. in its conversions menu, choose export openssh. make sure format is ssh-2 rsa.
3. save the key to ~/.ssh/id_isa
3. with ssh-keygen, do
ssh-keygen -y -f ~/.ssh/id_rsa > ~/.ssh/id_rsa.pub

Thursday, July 23, 2015

Some Basic Ideas

000

the nature of the data structure has a direct impact on the efficiency of the
algorithms

data structure conversion

Very often, in order to achieve better access time, one data structure need to be converted to another data structure that naturally support better access time. E.g. converting string to array or hashtable to convert its elements access time from O(n) to O(1). Conversion on data structure implies more space is required in memory.

Pointers or moving cursors

They are the key to in-place operation. often, one is to traversal the input, another is to the newly built result. for example, for in-place character removal in a string, destination cursor points to position that will be eventually returned back in new string, while source cursor is to traversal the original string character by character to make sure every characters are processed.

Other situations includes head pointer and tail pointer and they move towards middle, as in example to reverse a ward or sorted array, or to determine is a string is palindrome.

void reverseString( char str[], int start, int end ){
char temp;
while( end > start ){
/* Exchange characters */
temp = str[start];
str[start] = str[end];
str[end] = temp;
/* Move indices towards middle */
start++; end--;
}
}

Another obvious use of two pointers is merging two sorted arrays.

Binary search

it's used widely when finding an element in a sorted sequence.
middle point is just a concept of a point in between start and end, it does not has to be in the middle of start and end. A problem to find string in sorted string array with interspersed empty strings ss a good demo to this.

Hashtable vs Array

Hashtable has extra overhead which array does not, but array need to be predefined on its capacity, while hashtable only for content actually in the input. For vrey long string, Array can be a better choice since total number of distinct characters are predefined and likely the very long string will contain every character. Choice between them is situation by situation.

Array is a contiguous block of memory, while linked list is sparsely distributed. inserting or deleting an element in array requires every elements after it either be shifted left or right or even totally relocate to different memory area. while linked list just need to change a node's linkage. But traversal of linkedlist is linear and worse can be N if the element is at tail of singly linked list.

Recursion

All recursion can be converted to iteration.
it is most useful for tasks that can be defined in terms of similar subtasks. the key is to identify out relationship between whole and subtasks. Tree problems can typically be resolved by recursion because tree is composed of subtrees.

Base case: when to stop, when the subroutine can finish without calling itself
Recursive cases: the sub problems have not been resolved.

When converting recursive to iterative routine, usually start from base case, and build up the whole solution from bottom up. Recursive calls are generally used to preserve the current values of local variables and restore them when the subtask performed by the recursive call is completed.Since recursively calls are using stack to reserve the local variables, you can also use stack in iterative calls to mimic the recursive call behavior, but then very often, it is about more trouble than it’s worth.

with help of memorization, many recursive solution become more efficient by avoiding duplicated calculations, and can be called DP: dynamite programming.

It's common to prepare some parameter in another routine, and call the recursive routine from there.

Iteration

When converting recursion to iteration, think it either bottom up or top down. and also thinking about using a queue like structure to hold the intermediate results for further process until result is good.
Typical example would be using queue in tree's traversal, and string permutation example I gave in another article.

Stack and Queue

using stack to reverse order or access.(LIFO)
using queue to keep order or access.(FIFO)

String & Array

1. find first non-repeated character

--using a array indexed by character(128 for ASCII)
--using hash table/map if potential characters are in huge amount, do first loop to build up hashmap, do second loop to find first character that repeats only once
--to find last non-repeated, looping back or resort to a data structure like LinkedHashmap that keeps the insertion order(you can make onekey-value pair with equals and hashcode same as it's key's behavior)

Tree

BFT

BFT is used for level by level process, usually resorting to a queue to add a node, remove it, process it, and add its children to queue from left to right. keep processing it until queue is empty.

The queue here is changing constantly by removing a node adding its children to tail of queue.

Levels

usually you do not need to deal with level, if you do need to figure out level you are in, Another queue for serving queue of all children would be helpful. this case the normal queue mentioned above would serve for current level. while processing current level, when the current queue turns empty, it will add children back in, and clear out children queue. level will increase when this happens.

LinkedList is a queue in java.

 public static int getLevels(TreeNode aTree){
  if (aTree == null)return 0;
  int levels = 0;
  Queue<TreeNode> curLevel = new LinkedList<TreeNode>();
  Queue<TreeNode> nextLevel = new LinkedList<TreeNode>();
  //starting from root, level 0 will become to 1 after root processing
  curLevel.add(aTree);
  while(!curLevel.isEmpty()){
   TreeNode curNode = curLevel.remove();
   if (curNode.left!=null)
    nextLevel.add(curNode.left);
   if (curNode.right!=null)
    nextLevel.add(curNode.right);
   if(curLevel.isEmpty()){
    if (nextLevel.isEmpty())break;
    else{
     levels++;
     curLevel.addAll(nextLevel);
     nextLevel.clear();
    }
   }
  }
  return levels;
 }

Lowest Common Ancestor

If n1 is in left tree, and n2 is in right tree, then root is the lowest common node.
if n1 and n2 are both in left tree, then recursively start from left node; if n1 and n2 are both in right tree, then recursively start from right child node.

Complete Binary Tree

is a binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible.

A Binary Heap is a Complete Binary Tree where items are stored in a special order such that value in a parent node is greater(or smaller) than the values in its two children nodes. The former is called as max heap and the latter is called min heap

Since a Binary Heap is a Complete Binary Tree, it can be easily represented as array and array based representation is space efficient. If the parent node is stored at index I, the left child can be calculated by 2 * I + 1 and right child by 2 * I + 2.

Number to String

negative can be kept in flag variable. convert number to positive the first.

build it up from right to left.
n%10+'0' is character of a number.
next number to process is n/10
loop it until num=0

String to number

From left to right
charAt(i)-'0' is the real number n
n=n*10+(charAt(i)-'o')

Binary Heap

It is a complete binary tree(so that you can use math to determine the relationships between nodes).
all levels are full except bottom level. bottom level should be filled from left to right.

it is an implementation of priority queue.

(Binary)Heap is a structure a parent is no less or no greater than children.
It's conveniently implemented by array. if making no use of a[0](starting from a[1]), then for a node i, its left child is located at i*2 and right child is at i*2+1.It's parent is (i-1)/2. if making use of a[0], then its left child is i*2+1, right child is i*2+2, its parent is (i-1)/2.

Priority Queue

minimum and maximum priority queue. basic operation is constant time to identify and delete min or max, based on type of queue, element in the queue.

While adding element to priority queue, it sorts the elements to make sure minimum element or maximum element can be constantly accessed from get or delete operation. it does not care if rest of elements are sorted, it only makes sure get highest prioritized element in constant time.

Priority Queue implemented by Binary Heap based on Array
1. delete. delete the top, move end to top and sink down to correct position(restore heap)
2. add.add to the end and then swim/float up to correct position (restore heap)

Element is actually often (k,v)
Typical problem to solve: scheduling, find most.. least..
An impressive use is to merge K sorted linked list.

Fibonacci Sequence Like Problems

Third number is sum of first two numbers. ladder climbing, staircase climbing, fabonacci, tree hights etc. they can be expressed as f(n)=f(n-1)+f(n-2), but with different base cases. recursion is naturally way to deal with them, but due to efficiency concern, most of them need to be solved in iterative ways. DP is specifically good at resolving this kind of problem by using extra data structure to remember the calculated result so that it won't repeated/duplicated calculated. for DP, you can do from top-down or bottom-up, one of the way should be easier, cleaner and more efficient.

Ladder/staircase climbing: climb one or two steps a time.

I changed the problem to be from a bag of n items, either taking 1 or 2 items out a time, how many
different ways you can use to take items out of bag?

public static int climbStairs(int n) {
 if(n==1)return 1;
 if(n==2)return 2;
 return climbStairs(n-1)+climbStairs(n-2);
}
public static int climbStairsIterative(int n) {
 if(n==1)return 1;
 if(n==2)return 2;
 int m1=1,m2=2,tmp;
 
 for(int i=3;i<=n;i++){
  tmp =m1 + m2;
  m1=m2;
  m2=tmp;
 }
 return m2;
}

Binomial theorem

 $(x+y)^n = {n \choose 0}x^n y^0 + {n \choose 1}x^{n-1}y^1 + {n \choose 2}x^{n-2}y^2 + \cdots + {n \choose n-1}x^1 y^{n-1} + {n \choose n}x^0 y^n,$

Saturday, July 18, 2015

Longest Substring with At Most Two Distinct Characters

Problem

Given a string S, find the length of the longest substring T that contains at most two
distinct characters.
For example,
Given S = “eceba”,
T is "ece" which its length is 3.

Analysis

To me, this is similar to find length of longest substring without repeated characters, and it is even easier. not sure why it's marked one level harder than it.

With a sliding window, try to dynamically determine valid substring's head and tail. In this problem, it's easier and naturally to adjust only the header pointer since when tail meets third distinct character, that character will become part of next substring, so as well the third distinct character's left adjacent neighbors.

When finding the third distinct character, try to adjust header to the left most character of left adjacent character. e.g. aabbbcc, when meeting first c, both c and its left bs will be part of next sub string, including all adjacent cs.

I still use set as a generic approach to determine uniqueness of characters. array with char as index can also be used, that's smart but has constraints.

Solution

public static int longestSubStringAtMostTwoDistinctChar(String s){
 if(s==null)return -1;
 if(s.isEmpty())return 0;
 //this set contains characters that should be in current substring
 Set<Character> charSet = new HashSet<Character>();
 int maxLen=0,h=0,t=h;
 
 
 //having two pointers, one for head adjustment, one for substring tail adjustment
 //end of substring is marked by first character that makes set size of 3
 //
 while (h<s.length()){
  //inner loop stops while t pointing to third different character
  //this third character will become part of next substring
  //t only grows from 0 to n, when it reaches, process ends, 
  //so this is O(n) time complexity
  while(t<s.length() ){
   charSet.add(s.charAt(t));
   //after adding a character, we need check its size to tell 
   //if this is a third character
   if(charSet.size()>2)break;
   t++;
  }
  //now the tail is at position after the end of substring
  maxLen = Math.max(maxLen, t-h);
  //if tail is already at the end of string, then end the process
  if(t==s.length())return maxLen;
  //now adjust head pointer
  //e.g. abaac, it is actually easier to find the head position for next string
  //from tail to head. moving cursor from c back to a,a until it hits b, new start
  //s from a
  
  //then moving it towards head until it meets a different characer
  //adjacent character will be part of next substring
  //adjust h to t-1 the first sinnce t is now position after current substring
  h=t-1;
  while(h>0 && s.charAt(h)==s.charAt(h-1)){
   h--;

  }
  //when loop ends, h is either 0 or its previous character is different
  
  //remove the character from set
  //it loops back to here because there must be one character need to be removed
  charSet.remove(s.charAt(h==0?0:h-1));
  
  
   
  
 }
 return maxLen;
}

Friday, July 17, 2015

Longest Substring Without Repeating Characters

Problem:

Given a string, find the length of the longest substring without repeating characters. For example, the longest substring without repeating letters for "abcabcbb" is "abc", which the length is 3. For "bbbbb" the longest substring is "b", with the length of 1.

Analysis

Brute force solution would be starting from each position, and use an inner loop to find longest substring without repeated character. As a normal way, I use set to hold characters already visited. (even though it's smart to use array indexed with character, set is more generic approach) and it builds up a substring from start point to position before the first repeat of character. During the process, maximum length is tracked and will be returned at last.

This will work but very inefficient since it spend most of time working on rebuilding the shorter substrings that have been built. But this approach is a good start for people to understand how it works and why other ways are better. Basic understanding will also build up a base for further smart thought.

Solutions

import java.util.HashSet;
import java.util.Set;

public class longestNoneRepeat {

 public static void main(String[] args) {
  System.out.println(lengthOfLongestSubstringBruteForceModified("abcdefabcfabcdefhijka"));

 }


 /**
  * starting from a position, using inner loop to find tail still using set
  * to tell if character is repeated, but at start of each outter loop, set
  * need to be cleared Note: this is working,but inefficient. time limit
  * exceeded is what being reported by leetcode online judge.
  * 
  * @param s
  * @return
  */
 public static int lengthOfLongestSubstringBruteForce(String s) {
  Set<Character> charSet = new HashSet<Character>();
  int i = 0, maxLen = 0;
  for (int j = 0; j < s.length(); j++) {
   charSet.clear();
   charSet.add(s.charAt(j));
   i = j + 1;
   while (i < s.length() && !charSet.contains(s.charAt(i))) {
    charSet.add(s.charAt(i));
    i++;
   }

   // maxLen = Math.max(i-j, maxLen);
   maxLen = Math.max(charSet.size(), maxLen);
  }
  return maxLen;
 }

 /**
  * to avoid some of extra useless work. this will be better than brute
  * force, but extra loop of inner loop is still there. we need it do not
  * start over
  * 
  * @param s
  * @return
  */
 public static int lengthOfLongestSubstringBruteForceModified(String s) {
  Set<Character> charSet = new HashSet<Character>();
  int j = 0, i = 0, maxLen = 0;
  while (j < s.length()) {
   charSet.clear();
   charSet.add(s.charAt(j));
   i = j + 1;
   while (i < s.length() && !charSet.contains(s.charAt(i))) {
    charSet.add(s.charAt(i));
    i++;
   }

   // maxLen = Math.max(i-j, maxLen);
   maxLen = Math.max(charSet.size(), maxLen);
   // reset j, j should be a position after the character being
   // repeated
   if (i == s.length())
    break;
   while (s.charAt(j) != s.charAt(i))
    j++;
   j++;
  }
  return maxLen;
 }

 /**
  * accepted
  * 
  * @param s
  * @return
  */
 public static int lengthOfLongestSubstringBruteForceModified2(String s) {
  Set<Character> charSet = new HashSet<Character>();
  int j = 0, i = 0, maxLen = 0;
  while (j < s.length()) {

   charSet.add(s.charAt(j));
   i = j + charSet.size();
   while (i < s.length() && !charSet.contains(s.charAt(i))) {
    charSet.add(s.charAt(i));
    i++;
   }

   // maxLen = Math.max(i-j, maxLen);
   maxLen = Math.max(charSet.size(), maxLen);
   // reset j, j should be a position after the character being
   // repeated
   if (i == s.length())
    break;
   // adjust header, remove left elements to the repeated character
   // leave the repeated character in the set because it should be in
   // next substring
   while (s.charAt(j) != s.charAt(i)) {
    charSet.remove(s.charAt(j));
    j++;
   }
   // move head to next position of the repeated character
   j++;
  }
  return maxLen;
 }
 /**
  * abcdefabcfabcdefhijka 1. The way to remember occurrence of
  * character/element is to use a character array or set/hashtable as long as
  * they provide constant time to access 2. repeated or not is only
  * meaningful to current substring that is worked on 3. if at position j it
  * has a repeat, it must be repeating a character in i. how to find i? 4.
  * when find a duplication, it means the current substring is done, try to
  * modify the max length and remove the characters from set that are no
  * longer required in next substring's building up 5.how to find current
  * substring's length: j-i+1 6. how to find position of i: 6.1 with
  * character array: a[j]=a[i] because they are the repeated character which
  * is array's index. since all valid positions (building up substring
  * without repeated character) left to i can not build up a substring longer
  * than current substring, the new position for next substring is from i+1,
  * j can continue to move forward. when next repeated character is find, the
  * next substring's length is j-i+1 6.2 with set from i until character
  * being repeated, removing from set, this includes the repeated character.
  * but since repeated character at j is part on next substring, so that we
  * put it back to set later on 7. technique of innter loop: it uses cleaning
  * up status or removing from set to control how i is found 8. i is head
  * position of substring, j is tail position of substring
  * 
  * @param s
  * @return
  */
 public static int lengthOfLongestSubstringSet(String s) {
  Set<Character> charSet = new HashSet<Character>();
  int i = 0, maxLen = 0;
  for (int j = 0; j < s.length(); j++) {
   while (charSet.contains(s.charAt(j))) {
    charSet.remove(s.charAt(i));
    i++;
   }
   charSet.add(s.charAt(j));
   maxLen = Math.max(j - i + 1, maxLen);
  }
  return maxLen;
 }

}

Sunday, July 12, 2015

Get Lowest Common Ancester of Two Nodes in A Tree

if node 1 is in left tree and node 2 is in right tree, then root is the lowest common ancestor

if both node1 and node 2 are in left tree, then do same recursion to left tree, until lowest ancestor is found

Same logic is applied when node 1 in right tree and node 2 in left tree...

A helper function is required to tell if a node is in a tree.

   public static TreeNode getLastCommonParentRec(TreeNode root, TreeNode n1, TreeNode n2) {
        if (findNodeRec(root.left, n1)) {
            if (findNodeRec(root.right, n2)) {
                return root;
            } else {
                return getLastCommonParentRec(root.left, n1, n2);
            }
        } else {
            if (findNodeRec(root.left, n2)) {
                return root;
            } else {
                return getLastCommonParentRec(root.right, n1, n2);
            }
        }
    }

public static boolean findNodeRec(TreeNode root, TreeNode node) {
        if (root == null || node == null) {
            return false;
        }
        //root the first, this is preorder traversal
        if (root == node) {
            return true;
        }

        // find in left tree the first
        boolean found = findNodeRec(root.left, node);
        if (!found) {       // if it is not in left tree, then find it in right tree
            found = findNodeRec(root.right, node);
        }
        return found;
    }

Build Binary Tree from Array Traversal Result

From a tree, you can traversal it with preorder, inorder and post order (DFT) or level by level. You can not restore a tree from only one of result without extra information, but you have chance to restore it with two types of result.

For example, most common one is to restore a binary tree from preorder and inorder results.

In preorder, first element is the root, then entire left tree and then entire right tree.
In inorder, root is somewhere in the middle, left to it is entire left tree, right to it is entire right tree. you can use root from preorder to locate its position in an inorder array.

And with root position in in-order array, number of elements in left tree can be figured out, with this information, boundary of preorder's left and right subtrees can be extracted out.

Since each node is also a tree, then recursively applying this algorithm, it will be able to find children for all parents.

Preorder and post order won't work, because it can not find out root.

public static TreeNode rebuildBinaryTreeRec(List<Integer> preOrder, List<Integer> inOrder){  
        TreeNode root = null;  
        List<Integer> leftPreOrder;  
        List<Integer> rightPreOrder;  
        List<Integer> leftInorder;  
        List<Integer> rightInorder;  
        int inorderPos;  
        int preorderPos;  
   
        if ((preOrder.size() != 0) && (inOrder.size() != 0))  
        {  
            // first element in preorder array is root  
            root = new TreeNode(preOrder.get(0));  
   
            //  Based upon the root information, manupulate preorder array to separate it into left and right subarrays
   // find root according to root in preorder
            inorderPos = inOrder.indexOf(preOrder.get(0));      
            leftInorder = inOrder.subList(0, inorderPos);  
            rightInorder = inOrder.subList(inorderPos + 1, inOrder.size());  
   //then according to in order sub tree information, help preorder to determine the left and right tree
            preorderPos = leftInorder.size();
            leftPreOrder = preOrder.subList(1, preorderPos + 1);  
            rightPreOrder = preOrder.subList(preorderPos + 1, preOrder.size());  
   //then root's left subtree is the subtree built from left preorder and left inorder subtrees
            root.left = rebuildBinaryTreeRec(leftPreOrder, leftInorder); 
   //root's right tree is the tree built from right part of preorder subtree and right part of in order subtree
            root.right = rebuildBinaryTreeRec(rightPreOrder, rightInorder);   
        }  
  //return the root of subtrees
        return root;  
    }