# Category: Data Science

## Second Attempt with Keras

## First Attempt with Keras

## Tree Based Models

Regression tree

Xgboost manages only numeric vectors. So is decision trees in Sklearn

What to do when you have categorical data?

Conversion from categorical to numeric variables

Adaboost (Adaptive boost) http://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/

parameters for Xgboost

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

reg:linear simply square loss function

reg:logistic see logistic regression loss function

https://stats.stackexchange.com/questions/229645/why-there-are-two-different-logistic-loss-formulation-notations/231994#231994

objective function vs eval_metric

https://stackoverflow.com/questions/34178287/difference-between-objective-and-feval-in-xgboost

## Independent Component Analysis

Good resources for explanation

http://cs229.stanford.edu/notes/cs229-notes11.pdf

https://en.wikipedia.org/wiki/Independent_component_analysis

## Use neuronetwork(tensorflow) for regression 2

Alright, I also try to see what if only features (x) are normalized, y range in previous formula is too close to 1. So I change the formula to be

1 |
df['y'] = 10*3*df['x0']+10*np.sin(10*df['x1']) |

Then the calculation is repeated for the ones with and without normalized y's. The results and conclusions are obvious. Yes, you definitely want to do that.

## Use neuronetwork(tensorflow) for regression 1

After discussion with Mengxi Wu, the poor prediction rate may come from the reason that the inputs are not normalized. Mengxi also points out that the initialization of weights may need to be related to the size of the input for each hidden neuron. In http://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network, it points out that the weights needs to be uniformly initialized between , where is the number of inputs to a given neuron.

The first step of modification is normalizing all inputs to its std equals 1.

1 2 |
std = df.std() df = df/std |

The training results is consistently above 96% and is obviously better than previous ones without normalization.

This comparison is repeated for hidden =[20,30,40,50,60,80,100], the prediction rate is more converged with larger numbers of hidden neurons. And the same plot is made without normalization.

Later on, we added weights initialization using Xaiver method. The results are also attached.

1 2 |
W1 = tf.get_variable("W1", shape=[FN,hidden],initializer=tf.contrib.layers.xavier_initializer()) W2 = tf.get_variable("W2", shape=[hidden,1],initializer=tf.contrib.layers.xavier_initializer()) |

## Use neuronetwork(tensorflow) for regression 0

This post is not a tutorial, but rather a logbook of what we attempted.

The learning logbook starts with using neutonetwork to do regression.

The data is manually generate using a very simple formula.

, initially, we do not add any noise term.

1 2 3 4 5 6 |
features = {} FN=2 for i in range(FN): features['x'+str(i)] = np.random.rand(2000) df = pd.DataFrame(features) df['y'] = 3*df['x0']+np.sin(10*df['x1']) |

accuracy function is below

1 2 |
def accuracy(y,y0): return 1-np.sqrt(np.sum(np.square(y0-y)))/np.sqrt(np.sum(np.square(y0))).astype(np.float) |

In the first attempt, a single intermediate layer neuronetwork is applied.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
optimizer, prediction = None, None def train(hidden,training_rate = 0.1,decay_rate = 0.96): optimizer, prediction = None, None graph = tf.Graph() with graph.as_default(): x = tf.placeholder(tf.float32,[None, FN]) y = tf.placeholder(tf.float32,[None, 1]) W1 = tf.Variable(tf.truncated_normal([FN,hidden], stddev=0.001)) b1 = tf.Variable(tf.zeros([hidden])) W2 = tf.Variable(tf.truncated_normal([hidden,1], stddev=0.001)) b2 = tf.Variable(tf.zeros([1])) inter1 = tf.nn.relu(tf.matmul(x,W1) + b1) logits = tf.matmul(inter1,W2)+b2 regularizers = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) loss = tf.reduce_mean(tf.square(logits-y)) + 0*regularizers global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(training_rate, global_step,1000, decay_rate, staircase=True) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss,global_step=global_step) prediction = logits ac = [] st =[] r_previous = None with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() #print('Initialized') for j in range(25000): feed_dict = {x:df.iloc[:1000,:-1].values,y:df.iloc[:1000,-1].values[:,None]} _, l, predict,r1,r2,lr = session.run([optimizer,loss, prediction,W1,W2,learning_rate], feed_dict=feed_dict) if j%5000==0: acc = accuracy(predict.flatten(),df.iloc[:1000,-1].values[:1000]) #print("step",j," is ",acc,l,lr,r1.shape) #print "train results (tmp):",predict.flatten()[:5] #if r_previous is not None: # print "diff:",np.sqrt(np.sum(np.square(r_previous-r1))) r_previous = r1 ac.append(acc) st.append(j) #plt.figure() #plt.plot(st,ac) predict,r1,r2,lr = session.run([prediction,W1,W2,learning_rate],feed_dict={x:df.iloc[1000:,:-1].values,y:df.iloc[1000:,-1].values[:,None]}) #print "prediction:",accuracy(predict.flatten(),df.iloc[1000:,-1].values) #print "predicted results:",predict.flatten()[:10] return accuracy(predict.flatten(),df.iloc[1000:,-1].values) |

It is observed that, with 60 hidden neurons and 25000 training steps, the prediction accuracy can be largely fluctuated according to the initialization value.

## Gaussian Process Kernels

As I point out in http://www.jianping-lai.com/2017/03/10/guassian-process/, kernel can be decomposed into , where and .

For linear kernel ,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import numpy as np def k(a,b): return a*b x = np.arange(0,1.0,0.2) n = x.shape[0] c = np.zeros((n,n)) for i in range(n): for j in range(n): c[i,j] = k(x[i],x[j]) A,S,B = np.linalg.svd(c, full_matrices=False) L = np.linalg.cholesky(c+1e-15*np.eye(n)) z = np.matmul(A,np.sqrt(np.diag(S))) def Print(m): for row in m: for e in row: print '{:+.2f}'.format(e), print Print(z) print "########" Print(L) |

both the SVD and Cholesky decomposition leads to that

as we have . These results leads to a straight line.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import numpy as np def k(a,b): return min(a,b) x = np.arange(0,1.0,0.1) n = x.shape[0] c = np.zeros((n,n)) for i in range(n): for j in range(n): c[i,j] = k(x[i],x[j]) A,S,B = np.linalg.svd(c, full_matrices=False) L = np.linalg.cholesky(c+1e-15*np.eye(n)) z = np.matmul(A,np.sqrt(np.diag(S))) def diff(m): for idx in range(m.shape[0]-1): print np.sum(np.square(m[idx+1] - m[idx])) diff(z) print "########" diff(L) |

This result is essentially saying

Difference between y's of two nearby data points has a constant standard deviation of (subscript represents for data points sequentially). This gives you a randomized but continuous data structure.

## Guassian Process

for Gaussian Process, SVD should be equivalent to Cholesky decomposition; when co-variance matrix is positive definite, they should be equal.

1 2 3 4 5 6 7 8 9 |
import numpy as np u,s,v = np.linalg.svd(c) #c == np.matmul(np.matmul(u, np.diag(s)), v) #np.matmul(u,np.diag(s)) == np.matmul(np.diag(s),v).T L = np.cholesky(c) # most often L ~= np.cholesky(c+1e-15*np.eye(c.shape[0])) for numerical stablility # c == np.matmul(L,L.T) # L ~ np.matmul(u,np.diag(s)) # when c is positive definte, they should be equal |

as random variable

where is the covariance matrix and