data science

5v

  • volume
  • variety
  • value
  • velocity
  • veracity

data processing chain

  • data
  • database
  • data store
  • data ming
  • data value

data warehouse

  • subject oriented
  • integrate
  • non-volatile
  • time variant

data warehouse type

  • 星形
  • 雪花
  • 實例星座

data type

nameexmple
nominal data , unordered collection,明目dog cat name
ordinal data , ordered data ,次序level
interval data , equal intervals,(not Ture zero)(相除 不可代表比值),區間溫度
ratio data , fraction data,比值float
BLOB(binary large objects)image sound file

Quantile

Q1:small 0%~25%
Q2:small 25%~50%
Q3:small 50%~75%
Q4:small 75%~100%(the biggest)
interQuartile range:Q3-Q1

confusion matrix

Predicted PositivePredicted Negative
Actual Positive (True)TPFN
Actual Negative (False)FPTN

ROC curve

correlations

linear regression

similar function

h1h2h3h4
a1__0.5
b0.5_0.3_
c_0.7_0.4

jaccard similarity

cosine similarity

peason correlation coeffiecient

Feature Selection Methods

  • Filter methods
  • Wrapper methods
  • Embedded methods

Classification

Decision Tree

entropy(熵)

Information Gain(IG)

if then entropy is geting less then it is good divide point

gini index(Impurity)

random forest(Bootstrap Aggregation for decision tree)

use mutilple random Decision Tree to vote asnwer

naive bayes

zero conditional(when probability is zero)

t :possible condition number
n :case number
c :possble c
m :virtual sample

Bayesian Network

Bayesian networks are directed and acyclic, whereas Markov networks are undirected and could be cyclic.

SVM(Support Vector Machines)

linear separable SVM

linear inseparable SVM

map data to higher dimension

KNN( K Nearest Neighbor)

  • KNN 的演算法步驟如下:
    1. KNN 中的 K 代表找出距離你最近的 K 筆資料進行分類。
    2. 假設 K 是 3,代表找出前三筆相近的資料。
    3. 當這 K 筆資料被選出後,就以投票的方式統計哪一種類的資料量最多,然後就被分到哪一個類別。

Linear Classifiers

Logistic Regression

Gradient Descent

Coordinate Descent

Coordinate descent updates one parameter at a time, while gradient descent attempts to update all parameters at once

Improve Classification Accuracy

Bagging:Bootstrap aggregating

給定一個大小為 n 的訓練集 D,Bagging算法從中均勻、有放回地(即使用自助抽樣法)選出 m個大小為 的子集 ,作為新的訓練集。在這 m 個訓練集上使用分類、回歸等算法,則可得到 m 個模型,在透過取平均值、取多數票等方法,即可得到Bagging的結果。

Boosting

  1. A series of k classifiers are iteratively learned
  2. After a classifier Mi is learned, set the subsequent classifier, Mi+1, to pay more
  3. attention to the training tuples that were misclassified by Mi
  4. The final M* combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy

Imbalanced Data Sets

  • Oversampling: Oversample the minority class.
  • Under-sampling: Randomly eliminate tuples from majority class
  • Synthesizing: Synthesize new minority classes
  • Threshold-moving
    • Move the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors

MLP(ANN)

  • 人工神经元Artificial Neural Unit
  • 多层感知机Multi-Layer Perception(MLP)
  • 激活函数Activation function
  • 反向传播Back Propagation
  • 学习率Learning Rate
  • 损失函数Loss Function(MSE均方误差,Cross Entropy(CE)交叉熵:softmax函数)
  • 权值初始化Initialization (Xavier初始化,MSRA初始化)
  • 正则化Regularization(Dropout随机失活)
  • training set
    • 60% data
  • validation set
    • 20% data
  • Testing set
    • 20% data

k fold cross validtion

在 K-Fold 的方法中我們會將資料切分為 K 等份,K 是由我們自由調控的,以下圖為例:假設我們設定 K=10,也就是將訓練集切割為十等份。這意味著相同的模型要訓練十次,每一次的訓練都會從這十等份挑選其中九等份作為訓練資料,剩下一等份未參與訓練並作為驗證集。

Activation Function

Activation FunctionFormulaRange
Tanh
Sigmoid (Logistic)
ReLU
SoftPlus

scale features (nomralize)

Certainly! Here’s a markdown table summarizing the formulas for different types of feature scaling:

Scaling MethodFormulaRange
Z-Score Normalization
Min-Max Scaling
Maximum Absolute Scaling
Robust Scaling

Clustering Algorithms

  • Partition algorithms
    • K means clustering
    • Gaussian Mixture Model
  • Hierarchical algorithms
    • Bottom-up, agglomerative
    • Top-down, divisive

K means

give K point move point to find position fit all data

K-Medoids

group some point and find the mass center if the point is close merge in to the group if is far then remove from the group

Hierarchical Clustering(Bottom-up)

merge all point by dstance of clusters

distance of clusters

  • single-link(closest point of group)
  • complete-link(farthest point of group)
  • centroid(centers point of group)
  • average-link (average distance between of all elements of group)

weight vs unweight

use point number of group as weight or averge two group distance

DBSCAN (Bottom-up)

  1. start with every point as group it self
  2. if a,b group are in range of “high densty area” then merge as same group

pattern mining

FP tree

rule (is the itemset are frequent itemset)

if lift(A,B) >1 then A,B is effective

pattern type

  • Rare Patterns
  • Negative Patterns(Unlikely to happen together)
    • Kulczynski(A,B) Less than threshold
  • Multilevel Associations
  • Multidimensional Associations

constrant type

pruning Strategies

use diff condiation to pruning the iteamset early

  • Monotonic(相關性):
    • If c is satisfied, no need to check c again
  • Anti-monotonic:
    • If constraint c is violated, its further mining can be terminated
  • Convertible:
    • c can be converted to monotonic or anti-monotonic if items can be properly ordered in processing
  • Succinct:
    • If the constraint c can be enforced by directly manipulating the dat

Sequential Patterns

items within an element are unordered and we list them alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Representative algorithms

  • GSP (Generalized Sequential Patterns):
    • Generate “length-(k+1)” candidate sequences from “length-k” frequent sequences using Apriori
  • Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
  • Pattern-growth methods: PrefixSpan

dimensionality reduction(降维)

PCA(Principal component analysis )

PCoA(principal coordinate analysis )

SVD(Singular Vector Decomposition )