Find Nearest K points

Efficiently find nearest points KD tree

The basic idea is illustrated here

k-d Tree and Nearest Neighbor Search

though I don’t think the pruned areas are plotted correctly.

This algorithms are used for KNN

Similar tree structure quad-tree/octree are explained here

https://www.quora.com/What-is-the-difference-between-kd-tree-and-octree-Which-one-is-advantageous

 

SVM vs LR

SVM, LR typically give out similar results.

  1. LR is probabilistic, while SVM is non-probabilistic binary classifier (there are ways of get around of this)
  2. SVM is determined by support vectors (points lie between soft margin, this region size is determined by C/lambda), while RL is affected by all points. This is only true when no kernel is applied. Some ppl says SVM is less sensitive to outlier, while I also see opposite statement.
  3. Due to same above reason, linear SVM (no kernel) needs normalization,while LR does not. It is also likely SVM may have worse performance than LR due to the complex space distance measurement in high dimensional space.
  4. when applied with kernel tricks, it is found that SVM hold higher sparsity. Thus, SVM is better in computational complexity. (this is commonly brought up, but did not see why it is though).

 

Google Cloud Machine Learning

1. Create an Google Compute Engine Instance
2. Activate the Google Machine Learning API under this project
3. Create an Google Storage Bucket
4. Login to the Google Compute Engine, create a folder for the ML project, in my case I called it MLtest. Inside this folder, two basic configuration files are required.

a. config.yaml

It is important to set “runtimeVersion” to be the latest, otherwise, some functions may not be available.

trainingInput:
  scaleTier: BASIC
  runtimeVersion: "1.7"
  pythonVersion: '3.5'

b. setup.py

from setuptools import setup, find_packages

setup(name='example',
  version='0.1',
  packages=find_packages(),
  description='example to run keras on gcloud ml-engine',
  author='Jianping Lai',
  author_email='tbjc1magic@gmail.com',
  license='MIT',
  install_requires=[
      'keras',
      'h5py',
      'xgboost', ### put required packages here
  ],
  zip_safe=False)

5. The file inside the folder shows the structure blow.

.
├── config.yaml
├── input
│   ├── input.csv
│   └── test.pkl
├── MANIFEST
├── output
│   └── test-output.pkl
├── run
├── setup.py
└── train
    ├── 1-multiply.py
    ├── 2-input.py
    ├── 3-output.py
    ├── 4-args.py
    ├── 5-xgboost.py
    ├── 6-keras.py
    ├── 7-output2.py
    ├── 7-output3.py
    ├── 8-output.py
    └── __init__.py

to run a training module LOCALLY, such as 8-output.py, you type

python -m train.8-output

In order to submit the job to Google Cloud ML, the following command is required,

gcloud ml-engine jobs submit training tbjc_ml32 \
  	--package-path train \
        --module-name train.8-output \
        --region us-east1 \
        --config config.yaml \
	--staging-bucket gs://tbjc1magic-deeplearning/ \
        -- \   ### after this line everything will be cmd line args
        --input_dir gs://tbjc1magic-deeplearning/ 
        ###  ['--input_dir', 'gs://tbjc1magic-deeplearning/'] for main func

However, users will not be able to read and write directory to the cloud storage bucket. As this being said, below commands will NOT work.

with open('gs://tbjc1magic-deeplearning/', 'w') as f:
    f.write(...)

File IO can to be handle through

from tensorflow.python.lib.io import file_io

base = 'gs://tbjc1magic-deeplearning/'

with file_io.FileIO(base + 'output/test-output.pkl', mode='w') as f:
    pickle.dump(data, f)

Similarly, you can read a file through the same way.

6. To check the current status of jobs,

gcloud ml-engine jobs list

 

Anaconda virtual environment setup python 2, 3 and R

After install anaconda 3

## set up python 2 environment
conda create --n python2 python=2.7
source activate python2
conda install ipykernel
python -m ipykernel install --user --name python2
## set up R environment

conda create -n r anaconda
source activate r
### install R kernel ####
conda install -c r r
### install notebook R kernel ###
conda install -c r r-irkernel
### install R packages ###
conda install -c r r-essentials
conda install -c r ggplot
conda install -c r r-nlme
conda install -c r r-lme4

 

Tree Based Models

Regression tree

Xgboost manages only numeric vectors. So is decision trees in Sklearn

What to do when you have categorical data?

Conversion from categorical to numeric variables

Adaboost (Adaptive boost) http://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/

parameters for Xgboost

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

reg:linear simply square loss function
reg:logistic see logistic regression loss function
https://stats.stackexchange.com/questions/229645/why-there-are-two-different-logistic-loss-formulation-notations/231994#231994

objective function vs eval_metric

https://stackoverflow.com/questions/34178287/difference-between-objective-and-feval-in-xgboost