Twenty Percent Knowledge: May 2015

Friday, May 29, 2015

Analyzing Tomcat Log Files With LogStash, ElasticSearch and Kibana

Background

I have a plan to setup LogStash, ElasticSearch and Kibana to monitor and analyze log files from production tomcat servers. It was a smooth start to install and play around all these software. But then I was stuck for almost two days to figure out a way to parse the information from the log files. After tons of experiments, I finally understand them better and figured out how to analyze the log files from Tomcat.

Lessons learned

--even the popular open source products, their documents are still not complete and somewhat hard to understand, and many of solutions searched out are immature and useless.
--copying other people's examples is a good way to get started, but productivity is from fully understanding how the thing works.
--when learning, you can have a look at other people's complex work, but you'd better to start simple and build it up knowledge gradually so that you had less chance to spend time on wrong directions.

Solutions

There are two types of log files I want to analyze, one is having format like this:

[2015-05-15 13:03:15,999] WARN [thread-1-1] some message 1111
part of first message
third row of first message
[2015-05-15 13:04:15,999] WARN [thread-1-2] some message 22222

Another has format like this:

May 25, 2015 2:26:22 AM com.sun.xml.ws.server.sei.TieHandler createResponse
SEVERE: PreparedStatementCallback; uncategorized SQLException for SQL [xxxx]; SQL state [72000]; error code [20000]; ORA-20000: Concurrency Failure
ORA-06512: at "xxxx", line 10
ORA-04088: error during execution of trigger xxxx'

....

May 25, 2015 2:26:40 AM javaclass method
....

I decided to define two TCP port each type of log files so that processing can be logically separated.

This is the extra patterns I defined. They are put in a file called amaPatterns.

amaPatterns:

EVERYTHING (.*)
SSS (?:[\d]{3})
MYTIME %{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND},%{SSS}
CATALINA_DATESTAMP %{MONTH} %{MONTHDAY}, 20%{YEAR} %{HOUR}:?%{MINUTE}(?::?%{SECOND}) (?:AM|PM)
CATALINALOG %{CATALINA_DATESTAMP:logTime} %{JAVACLASS:class} %{NOTSPACE:methodName}\n%{LOGLEVEL:logLevel}: %{EVERYTHING:logMsg}

LogStash Configuration

This is the configure file for LogStash. It's defined in a file called first_config.cfg.

(TODO: provide complete solution for two types of log files)
first_config.cfg:
input {
#if elasticSearch has explicit name, then it should be used here
#I used stdin for testing purpose. real log files are fed to tcp port 666
#nc localhost 6666

tcp {
type => "catalina.out"
port => 6666
add_field => { "server" => "esb-java" fileSrc =>"cata.out" }
}
tcp {
type => "catalina.date"
port => 6667
add_field => { "server" => "esb-java" fileSrc =>"cata.date" }
}
#for debugging purpose
#when testing content from different tunnel, change type here correspondingly so you can
#use stdin as input for that branch of processing
# stdin {
# type => "catalina.date"
# }

}
#by default, every line is an event to logstash.
#multiline filter can combine lines/events into one event
#this definition means, for all the events that do not match given pattern,
#they should be merged to previous event
#in my case, a line start with [yyyy-mm-dd is a matching, all subsequent lines without matching will
#be merged into this matched line and dump out as one event
filter {
if [type] == "catalina.out" {
multiline {
patterns_dir => "/root/logstash-1.5.0/patterns"
pattern => "\[%{MYTIME}\]%{EVERYTHING}"
negate => true
what => "previous"
}
#this filter parse out logTime, logLevel and rest of content as logMsg
grok {
patterns_dir => "/root/logstash-1.5.0/patterns"
match => [ "message", "\[%{MYTIME:logTime}\] %{LOGLEVEL:logLevel} %{EVERYTHING:logMsg}" ]
}
#if "_grokparsefailure" in [tags] {
# grok {
# patterns_dir => "/root/logstash-1.5.0/patterns"
# match => [ "message", "%{CATALINALOG}" ]
# }
}
#filters are executed per sequence defined here
#last filter's output event will be next filter's input event--first input event is a line of data
#input event's fields can be further manipulated for untouched
#it's better to enable the filters one by one in order to make sure the up stream ones are
#working properly before moving to down stream ones
if [type] == "catalina.date" {
multiline {
patterns_dir => "/root/logstash-1.5.0/patterns"
pattern => "%{CATALINA_DATESTAMP}%{EVERYTHING}"
negate => true
what => "previous"
}

grok {
patterns_dir => "/root/logstash-1.5.0/patterns"
match => [ "message", "%{CATALINALOG}" ]
}
}

#for logTime, specifying its format here. Date filter will convert it to date type and
#assign this time as this event's event timestamp
#if you do not have this, event's timestamp will be the time the event is given to logStash
#by default, index name is ended with day of message's timestamp. so if the timestamp is mapped
#to past, an index named with past date will be created. Also, it does not matter how many input files
#you have, the index name will be only correlated to time stamps the event happened. e.g. one input #will end up with multiple indices suffixed with a date in the past.
date {
match => [ "logTime", "yyyy-MM-dd HH:mm:ss,SSS", "yyyy-MM-dd HH:mm:ss,SSS Z", "MMM dd, yyyy HH:mm:ss a" ]
}
}
output {
if [fileSrc] == "cata.out" {

elasticsearch {
host => localhost
cluster=>emaES
index => "logstash-cata-out-%{+YYYY.MM.dd}"
}
}
if [fileSrc] == "cata.date" {
elasticsearch {
host => localhost
cluster=>emaES
index => "logstash-cata-date-%{+YYYY.MM.dd}"
}

}

#stdout for testing purpose
# stdout { codec => rubydebug }
}

Sample Output

Sample output in the stdout after it works. This example is for type of catalina.out. First four lines are inputs. Please notice the @timestamp has been changed to time in the log rows.

[2015-05-15 13:03:15,999] WARN [thread-1-1] some message 1111
part of first message
third row of first message
[2015-05-15 13:04:15,999] WARN [thread-1-2] some message 22222
{
       "message" => "[2015-05-15 13:03:15,999] WARN [thread-1-1] some message 1111\npart of first message\nthird row of first message",
      "@version" => "1",
    "@timestamp" => "2015-05-15T20:03:15.999Z",
          "host" => "hadoop1.hadooptest",
          "tags" => [
        [0] "multiline"
    ],
       "logTime" => "2015-05-15 13:03:15,999",
      "logLevel" => "WARN",
        "logMsg" => " [thread-1-1] some message 1111\npart of first message\nthird row of first message"
}
{
       "message" => "[2015-05-15 13:04:15,999] WARN [thread-1-2] some message 22222",
      "@version" => "1",
    "@timestamp" => "2015-05-15T20:04:15.999Z",
          "host" => "hadoop1.hadooptest",
       "logTime" => "2015-05-15 13:04:15,999",
      "logLevel" => "WARN",
        "logMsg" => " [thread-1-2] some message 22222"
}

Loading log files to ElasticSearch through LogStash

For catalina.date files, do this

nc localhost 6667
For catalina.out, do this
nc localhost 6667

in screenshot below, you can see the index name is automatically assigned according to date in event time stamp.

Visualizing and Analyzing with Kibana

Tuesday, May 26, 2015

Start to Learn ElasticSearch/Elastic and Kibana

This will be my task for next couple weeks.

ElasticSearch

Elastic is based on Apache Lucene, trying to provide simple RESTful interface for people to use complex Lucene. Beside that, it is also a
--key value document store where every field is indexed and searchable
--distributed search engine with real-time analytic

(Elasticsearch is a distributed document store. It can store and retrieve complex data structures—serialized as JSON documents—in real time. In other words, as soon as a document has been stored in Elasticsearch, it can be retrieved from any node in the cluster.)

--document in elastic search is data serialized in JSON. elastic search will convert data to json before storing it.

it uses structured JSON document.
(JSON is a way of representing objects in human-readable text. It has become the de facto standard for exchanging data in the NoSQL world. When an object has been serialized into JSON, it is known as a JSON document.)

Installation

1. Download 1.5.2 and extract it. impressed that it is only 30mb.
2. ./elasticsearch -XX:-UseSuperWord to start it due to my jdk is 1.7 and I do not want to destroy the hadoop environment that is built upon this Java environment.
3. curl -X get http://localhost:9200
I got this response. I've been listening to See You Again a lot these days, Wiz kid kind of making an impression that the Kid is smoking.

{
"status" : 200,
"name" : "Wiz Kid",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.5.2",
"build_hash" : "62ff9868b4c8a0c45860bebb259e21980778ab1c",
"build_timestamp" : "2015-04-27T09:21:06Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}

4. shut it down

ctl+C

curl -XPOST 'http://localhost:9200/_shutdown'

Installing Marvel

Marvel

ElasticSearch's monitoring tool. free for development.

./bin/plugin -i elasticsearch/marvel/latest

Marvel Dashboard

http://localhost:9200/_plugin/marvel/

Marvel Sense: an interactive tool to communicate with elasticSearch

http://localhost:9200/_plugin/marvel/sense/

Talking to ElasticSearch

Basically two ways.

1. Java API using the native Elasticsearch transport protocol via port 9300.

1.1 using a non-data node

1.2 using a transport client

2. RESTful API with JSON via 9200. this is http protocol so that you can use curl to talk with it.

curl -X ':///?' -d ''

verb: GET, POST, PUT, HEAD, or DELETE

(

PUT: store this object at this URL

POST:store this object under this URL

)

protocol: htp or https

port: default elastic search's port is 9200

QUERY_STRING: optional parameters

body: A JSON-encoded request body (if the request needs one.)

for example, the following command display response in pretty format for easy reading and include http response header by tell curl it by -i.

curl -i -XGET 'localhost:9200/_count?pretty' -d '

{

"query": {

"match_all": {}

}

Try it out

basic but very useful information is here:

https://www.elastic.co/guide/en/elasticsearch/guide/current/_indexing_employee_documents.html

This is put here for enhancing memory.

Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns

Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields

Had a taste on creating index and search it from simple to complex.

Learnt how its clustering works.

My experiments

1. retrieving document from Oracle

set termout off

set feedback off

set header off

set pagesize 0

set newpage none

spool spriden_es.shl

select 'curl -XPUT ''http://localhost:9200/personindex/person/'||spriden_id||''' -d ''

{

"spriden_id" : "'||SPRIDEN_ID||'",'||'

"spriden_last_name": "'||replace(SPRIDEN_LAST_NAME,'''','')||'",'||'

"spriden_first_name" : "'||replace(SPRIDEN_FIRST_NAME,'''','')||'",'||'

"spriden_mi" :"'||SPRIDEN_MI||'"'||'

}

'''

from spriden

where spriden_change_ind is null;

spool off

exit

2. chmod 755 spriden_es.shl and execute it

this will load half million documents to elasticSearch under index called personindex of type person.

This is very brutal and might crash elasticSearch. I will find out more efficient way of data loading after learning more about it.

3. a closer real world query

##I do not understand why this is working only on spriden_id, while changing id to be last name or first name, it's not working.

##it turns out working on lower case only. white space generally won't work and have to be specially treated.

curl -XGET 'http://localhost:9200/personindex/person/_search?pretty' -d '

{

"query":{

"term": {

"spriden_id": "mylangid"

}

#GET /personindex/person/_search

curl -XGET 'http://localhost:9200/personindex/person/_search?pretty' -d '

{

"query":{

"filtered": {

"filter": {

"bool": {

"must": [

{"term": {

"spriden_last_name": "ma"

}},

{"term": {

"spriden_id": "myLangId"

}

]

}

4.mapping to SQL

4.1Exact match. score is always 1

-->a=b

{"term":{"a":"b"}}

--< a in(1,2,..)

{

"terms" : {

"a" : [1, 2]

}

-->a=b and (c=d or e=f)

{

"bool" : {

"must" : [],

"should" : [],

"must_not" : [],

}

-->range

a between c and d

"range" : {

"a" : {

"gt" : c,

"lt" : d

}

4.2 full text query, score/relevance will be calculated and returned.

match

(I do not know why this did not work out anything by replacing match with term)

curl -XGET 'http://localhost:9200/personindex/person/_search?pretty' -d '

{

"query" : {

"match" : {

"spriden_first_name": "ShiJie"

}

##this one request to match all words

curl -XGET 'http://localhost:9200/personindex/person/_search?pretty' -d '

{

"query" : {

"match" : {

"spriden_first_name": {

"query":"Shi Jie",

"operator": "and"

}

or you can do this:

curl -XGET 'http://localhost:9200/personindex/person/_search?pretty' -d '

{

"query":{

"match_phrase" : {

"spriden_first_name": "Shi Jie"

}

Kibana

After starting up Kibana, I found the sample data is too simple, so that I decide to remove the index and rebuild it with more meaningful data. This time, beside person, person's birth date and their basic student information are also included.

Script for data generation and loading, only 9999 records are retrieved for experiment.

select 'curl -XPUT ''http://localhost:9200/personindex/person/'||spriden_id||''' -d ''

{

"spriden_id" : "'||SPRIDEN_ID||'",'||'

"spriden_last_name": "'||SPRIDEN_LAST_NAME||'",'||'

"spriden_first_name" : "'||replace(SPRIDEN_FIRST_NAME,'''','''''')||'",'||'

"spriden_mi" :"'||SPRIDEN_MI||'",'||'

"term_registered" :"'||SGBSTDN_TERM_CODE_EFF||'",'||'

"term_start_date" :"'||to_char(STVTERM_START_DATE,'yyyy-mm-dd hh24:mi:ss')||'",'||'

"program_registered" :"'||SGBSTDN_PROGRAM_1||'",'||'

"major_registered" :"'||SGBSTDN_MAJR_CODE_1||'",'||'

"stst" :"'||SGBSTDN_STST_CODE||'",'||'

"bday" :"'||to_char(SPBPERS_BIRTH_DATE,'yyyy-mm-dd hh24:mi:ss')||'"'||'

}

'''

from spriden join sgbstdn

on spriden_pidm=sgbstdn_pidm

join stvterm on STVTERM_CODE=SGBSTDN_TERM_CODE_EFF

join SPBPERS on spriden_pidm=SPBPERS_PIDM

where spriden_change_ind is null

and rownum<10000 div="">

;

Rebuild index

--define a type with date columns explicitly defined.

--define a type by putting one piece of data, get its mapping, modify it and reapply it

curl -XGET 'http://localhost:9200/personindex/_mapping/person?pretty'

curl -XDELETE 'http://localhost:9200/personindex'

curl -XPUT 'http://localhost:9200/personindex/'

curl -XPUT 'http://localhost:9200/personindex/_mapping/person' -d '

{

"person" : {

"properties" : {

"bday" : {

"type" : "date"

"major_registered" : {

"type" : "string"

"program_registered" : {

"type" : "string"

"spriden_first_name" : {

"type" : "string"

"spriden_id" : {

"type" : "string"

"spriden_last_name" : {

"type" : "string"

"spriden_mi" : {

"type" : "string"

"stst" : {

"type" : "string"

"term_registered" : {

"type" : "string"

"term_start_date" : {

"type" : "date"

}

--reload the index by running generated script

Define index in Hibana

Now after typing the index name, time fields are displayed.

Now I can play around the Kibana to see what it can do for me.

Plan for next is to use also lean logstash and build up a system to monitor tomcat web log on application errors.

Friday, May 22, 2015

Some Better Answers

Talking about project

what people mostly interested in is the impact of it.

briefly what it is, then emphasis the impact of it: good result

for example, the bigdata concept and process I delivered made IT department not only be aware of this technology, but also has confident when later other department request IT's support of big data related course designs.

Do you know what is slow changing dimension?

Dimension that changes overtime, but slowing. for example, a production's promotion status, a person's gender.

There are ways to handle the situation,
type 0 solution ignores the change as a solution.
type 1 overrides old value with new value. this solution lose history so you will have trouble when examining data in the past.
type 2 create new record of dimension, keeping a date range the value is in effect.
type 3 adding new column. I do not prefer to use because it can not capture all the changes
type 4 solution is to have historical table and define date range each record is in effect
type 6 is 1+2+3. this is no better than type 4.

Describe how map reduce works?

In programming, you typically provides two functions.one is called mapper, another is called reducer.
mapper takes a list of key-value pairs, and transform them to another type of list of key-value pairs. mapreduce framework will typically do some shuffling on the output from mapper. reducer will be takes the shuffled result and generate another list of key-value pairs that is some kind of aggregation.
to be simple, mapper focuses on transformation, and reducer focuses on aggregation.

What's the benefit of distributed computing?

Parallel computing; divide and conquer; moving process instead of moving data that is important in big data processing since moving data to process is too expensive.

What is database sharding?

it is splitting data into smaller manageable pieces and support each pieces on cheap commodity hardware. it's a shared nothing architecture. usually is natively supported in NoSQL databases.

It belongs to horizontal partitioning.

SQL Server has database federation. Oracle have no native support so far. partitioning probably is the best bet in Oracle.

Or from web,
Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

What is partitioning?

From Oracle:
Partitioning allows a table, index, or index-organized table to be subdivided into smaller pieces, where each piece of such a database object is called a partition. Each partition has its own name, and may optionally have its own storage characteristics.

From the perspective of a database administrator, a partitioned object has multiple pieces that can be managed either collectively or individually. This gives the administrator considerable flexibility in managing partitioned objects. However, from the perspective of the application, a partitioned table is identical to a non-partitioned table;

It's horizontally dividing data into smaller pieces within one database.
usually automatically scheduled jobs need to run to dynamically apply the partitioning function as time goes.

What is difference between arraylist and linked list?

One is a list implementation based on array, another is a list implementation based on linked nodes. array boosts instant random access, but bad at insert and update because rest of list would have to do memory shifting. it's also waste memory when the newly extended array list contains only few elements in the newly allocated memory. Linked list occupies more memory because of the more complex underlying data structure, slower on access element because it has to traversal the list in order to find targeted element. but it is good at insert and delete operation. so use them for different situation. If your data is relatively static and random access is important, then use arraylist. If data is dynamic and update operation is overwhelming, then use linkedlist.

What is static variable?

It's also called class variable. no matter how many instances you created from class, there is only one variable that is shared within the instances.

What is Set operation?

operation between sets. for example, union, union all, intersect, minus/except
Oracle also provides similar functions for collections.
multiset except|intersect|union

What is correlated query?

it's a subquery that use value from other query. for each of rows in other query, correlated query will be evaluated once for each row in other query. performance might be issue if other query's data set is big.

How to optimize it?

converting it to some form of subqueries that is not correlated, such as a view, a subquery, a CTE(subquery factory) and then then use join instead of being part of where condition, it might calculate more data, but it runs only once.

Tell me about your experience on Hadoop

I set up 3 nodes hadoop clusters to demonstrate Hadoop ecosystems (mapReduce, pig, hive,zookeeper, hbase etc) to Langara's IT department.

I also took part in IBM's bigData hacking session to learn its bluemix cloud solution and BigInsight platform based on Hadoop.

What is inner join and outer join?

What will happen if outer join has duplication in one table? See two rows in left table, one row with same key in right table, what will be result in left join, right join and full join?

Inner join returns rows matching on cafeterias in both tables.
outer join returns also returns rows having no matching in the other table.

all three situations returns same result due to there is a matching on both side, duplication will be simply reflected in the result. all of them will return 2 rows. (I gave wrong answer on right join)

with
q1 as
(
select '1' as k1,'first row from q1' as des from dual
union all
select '1' as k1,'second row from q1' as des from dual
),
q2 as
(
select '1' as k1,'first row from q2' as des from dual
)
select * from q1 full outer join q2 on q1.k1=q2.k1;
--select * from q1 left outer join q2 on q1.k1=q2.k1;
--select * from q1 right outer join q2 on q1.k1=q2.k1;

If there's no matching keys, the left join returns 2, right join returns 1 and full joins return 3 row.
with
q1 as
(
select '1' as k1,'first row from q1' as des from dual
union all
select '1' as k1,'second row from q1' as des from dual
),
q2 as
(
select '2' as k1,'first row from q2' as des from dual
)
select * from q1 full outer join q2 on q1.k1=q2.k1;
--select * from q1 left outer join q2 on q1.k1=q2.k1;
--select * from q1 right outer join q2 on q1.k1=q2.k1;

What's the benefit of distributed processing?

My mind set was not switched to right channel, but afterward, I thought he should be expecting to here these:
--divide and conquer
--distributed work load
--parallel processing
--moving process instead of moving data
--cheap computer to delivery super calculation power
--efficient use of computing powser
--scaling out easily and flexible than centralized computing model

What do you keep in mind while designing a servier?

This can be served by SOA service-orientation principles:

standardized service contract
loose coupling
abstraction
reusability
autonomy
statelessness
discoverability
composability

What to do to delivery Availability?

Availability basically means clustered services so that in order to be clustered together
service has to be designed and implemented with these in mind:
autonomy
statelessness

What is scalability?

Sociability is basically a way to fulfill changed demand. it's better described as elasticity. meaning
it scales up or down. in order to scale the service, service should be able to cluster together. its designed should put these in mind:
autonomy
statelessness

In OR Mapping, how to realize parent children relationship?

Thursday, May 21, 2015

Second Experience With GitHub

Well, my first experience is downloading a repository from GitHub with pure git command.

The second one, I was trying to use tools provided by GitHub to interact with GitHub.

1. Created a repository called MonitorService on GitHub under my account.
2. With Git Shell, in the working directory, I executed git init to make a local repository
3. In GitHub, I added the local repository and configured the remote repository's URL. then it failed while I tried to publish to remote after putting changes into staging area.
4. web suggested to use git push, it failed but prompted to pull the first since there's already something there while the repository was generated, a readme file in my case.
5. so these resolved the problem

git pull origin master
git push --set-upstream origin master

now back to GitHub, status became sync instead of publish.

Glad to have this failed experience. I am pretty naive on git still.

But if you are interested, the content is a piece of work I spent 13 hours on to fulfill a technical assignment in a job interview. It is a monitoring service that monitors if a TCP/IP based service is up or down. It's a small work, but it has its own network protocol and it provides basic functions for you to
--define a service
--subscribe a service
--set polling period for individual subscriber, meaning each subscriber can have different polling frequency
--schedule outage period
--list registered service
--quit connection
--shutdown the monitor service from client

For learning purpose, it demonstrates observer design pattern, TCP/IP programming and unit testing. So it is a nice example to check how a server looks like.

The GitHub repository is located at

https://github.com/shijiema/MonitorService