data science

data processing chain

data
database
data store
data ming
data value

data warehouse

subject oriented
integrate
non-volatile
time variant

data type

name	exmple
nominal data , unordered collection,明目	dog cat name
ordinal data , ordered data ,次序	level
interval data , equal intervals,(not Ture zero)(相除不可代表比值),區間	溫度
ratio data , fraction data,比值	float
BLOB(binary large objects)	image sound file

Quantile

Q1:small 0%~25%
Q2:small 25%~50%
Q3:small 50%~75%
Q4:small 75%~100%(the biggest)
interQuartile range:Q3-Q1

confusion matrix

	Predicted Positive	Predicted Negative
Actual Positive (True)	TP	FN
Actual Negative (False)	FP	TN

$TPR (recall | sensitivity) : \frac{TP}{TP + FN}$ $FPR = \frac{FP}{TN + FP}$ $TNR (specificity) : \frac{TN}{TN + FP}$ $PP V : (precision | positive predictive) : \frac{TP}{TP + FP}$ $NP V (negative predictive) : \frac{TN}{TN + FN}$ $F_{1} score(F-measure | F-score) = \frac{2 TP}{2 TP + FP + TN} = \frac{2}{\frac{1}{p rec i s i o n} + \frac{1}{rec a ll}}$

ROC curve

correlations

$co v (x, y) = \frac{i \sum ( x _{i} - x ^ ) ( y _{i} - y ^ )}{n}$ $r = \frac{i \sum ( x _{i} - x ^ ) ( y _{i} - y ^ )}{i \sum ( x _{i} - x ^ ) ^{2} i \sum ( y _{i} - y ^ ) ^{2}}$

$r^{2} = \frac{( i \sum ( x _{i} - x ^ ) ( y _{i} - y ^ ) ) ^{2}}{i \sum ( x _{i} - x ^ ) ^{2} i \sum ( y _{i} - y ^ ) ^{2}}$

linear regression

$\overset{y}{^} = a + b x$ $b = \frac{i \sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{i \sum ( x _{i} - x ˉ )} = \frac{i \sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{i \sum ( x _{i} - x ˉ ) ^{2}} = \frac{\frac{1}{n} i \sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\frac{1}{n} i \sum ( x _{i} - x ˉ ) ^{2}} = \frac{co v ( x , y )}{v a r ( x )}$ $a = \overset{y}{ˉ} - b \overset{x}{ˉ}$

similar function

	h1	h2	h3	h4
a	1	_	_	0.5
b	0.5	_	0.3	_
c	_	0.7	_	0.4

jaccard similarity

$A = {h 1, h 4}, B = {h 1, h 3}, C = {h 2, h 4} s im (a, b) = \frac{a \cap b}{a \cup b} = \frac{{ h 1 , h 4 }}{{ h 1 , h 3 , h 4 }} = \frac{1}{3}$

cosine similarity

$A = 100 0.5 B = 0.5 0 0.3 0 C = 0 0.7 0 0.4 s im (A, B) = \frac{A \cdot B}{∣ A ∣ \times ∣ B ∣} = 0.29$

peason correlation coeffiecient

$cosine similarity (x - \overset{x}{^}, y - \overset{y}{^})$

Feature Selection Methods

Filter methods
Wrapper methods
Embedded methods

Classification

Decision Tree

entropy(熵)

$P (x) = probability of x space have n condition e n t ro p y (s p a ce) = - i \sum n (P (i) \times l o g_{2} (P (i)))$

Information Gain(IG)

if then entropy is geting less then it is good divide point $I G (S, A) = e n t ro p y (S) - i \sum A \in S P (i ∣ S) \times e n t ro p y (i)$

gini index(Impurity)

$G I (S, A) = 1 - i \sum A \in S P (i ∣ S)^{2}$

random forest(Bootstrap Aggregation for decision tree)

use mutilple random Decision Tree to vote asnwer

naive bayes

$P (h ∣ {a, b, a, c}) = P (h) * P (a ∣ h)^{2} P (b ∣ h) P (c ∣ h)$ $P (\neg h ∣ {a, b, a, c}) = P (\neg h) * P (a ∣\neg h)^{2} P (b ∣\neg h) P (c ∣\neg h)$

zero conditional(when probability is zero)

t :possible condition number
n :case number
c :possble c
m :virtual sample

$P (a ∣ c) = \frac{a _{c} + m _{c} \frac{1}{t}}{n _{c} + m _{c}}$

Bayesian Network

Bayesian networks are directed and acyclic, whereas Markov networks are undirected and could be cyclic.

SVM(Support Vector Machines)

linear separable SVM

linear inseparable SVM

map data to higher dimension

KNN( K Nearest Neighbor)

KNN 的演算法步驟如下：
1. KNN 中的 K 代表找出距離你最近的 K 筆資料進行分類。
2. 假設 K 是 3，代表找出前三筆相近的資料。
3. 當這 K 筆資料被選出後，就以投票的方式統計哪一種類的資料量最多，然後就被分到哪一個類別。

Linear Classifiers

Logistic Regression

$\frac{y}{1 - y} = w^{T} x + b$

Gradient Descent

Coordinate Descent

Coordinate descent updates one parameter at a time, while gradient descent attempts to update all parameters at once

Improve Classification Accuracy

Bagging:Bootstrap aggregating

給定一個大小為 n 的訓練集 D，Bagging算法從中均勻、有放回地（即使用自助抽樣法）選出 m個大小為 $n^{'}$ 的子集 $D_{i}$ ，作為新的訓練集。在這 m 個訓練集上使用分類、回歸等算法，則可得到 m 個模型，在透過取平均值、取多數票等方法，即可得到Bagging的結果。

Boosting

A series of k classifiers are iteratively learned
After a classifier Mi is learned, set the subsequent classifier, Mi+1, to pay more
attention to the training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy

Imbalanced Data Sets

Oversampling: Oversample the minority class.
Under-sampling: Randomly eliminate tuples from majority class
Synthesizing: Synthesize new minority classes

Threshold-moving
- Move the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors

MLP(ANN)

人工神经元Artificial Neural Unit
多层感知机Multi-Layer Perception(MLP)
激活函数Activation function
反向传播Back Propagation
学习率Learning Rate
损失函数Loss Function（MSE均方误差，Cross Entropy（CE）交叉熵：softmax函数）
权值初始化Initialization (Xavier初始化，MSRA初始化）
正则化Regularization（Dropout随机失活） $y = w^{T} x + b$
training set
- 60% data
validation set
- 20% data
Testing set
- 20% data

k fold cross validtion

在 K-Fold 的方法中我們會將資料切分為 K 等份，K 是由我們自由調控的，以下圖為例：假設我們設定 K=10，也就是將訓練集切割為十等份。這意味著相同的模型要訓練十次，每一次的訓練都會從這十等份挑選其中九等份作為訓練資料，剩下一等份未參與訓練並作為驗證集。

Activation Function

Activation Function	Formula	Range
Tanh	$f (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}$	$(- 1, 1)$
Sigmoid (Logistic)	$f (x) = \frac{1}{1 + e ^{- x}}$	$(0, 1)$
ReLU	$f (x) = max (0, x)$	$(0, \infty)$
SoftPlus	$f (x) = \frac{e ^{x}}{1 + e ^{x}} = \frac{1}{1 + e ^{- x}}$	$(0, \infty)$

scale features (nomralize)

Certainly! Here’s a markdown table summarizing the formulas for different types of feature scaling:

Scaling Method	Formula	Range
Z-Score Normalization	$σ = standard deviation x^{'} = \frac{x - x ˉ}{σ}$	$(- \infty, + \infty)$
Min-Max Scaling	$x^{'} = \frac{x - min ( x )}{max ( x ) - min ( x )}$	$(0, 1)$
Maximum Absolute Scaling	$x^{'} = \frac{x}{max ( x )}$	$(0, 1)$
Robust Scaling	$x^{'} = \frac{x - median ( x )}{Q 3 - Q 1}$	$(- \infty, + \infty)$

Clustering Algorithms

Partition algorithms
- K means clustering
- Gaussian Mixture Model
Hierarchical algorithms
- Bottom-up, agglomerative
- Top-down, divisive

single-link(closest point of group)
complete-link(farthest point of group)
centroid(centers point of group)
average-link (average distance between of all elements of group)

weight vs unweight

use point number of group as weight or averge two group distance

DBSCAN (Bottom-up)

start with every point as group it self
if a,b group are in range of “high densty area” then merge as same group

pattern mining

FP tree

rule (is the itemset are frequent itemset)

$support (A \to B) = P (A \cap B)$
$confidence (A \to B) = P (A ∣ B)$
$lift (A \to B) = \frac{P ( A \cap B )}{P ( A ) P ( B )}$
$AllConf (A, B) = \frac{P ( A \cap B )}{max ( P ( A ) , P ( B ))}$
$MaxConf (A, B) = max (P (A ∣ B), P (B ∣ A))$
$Kulczynski (A, B) = \frac{1}{2} (P (B ∣ A) + P (A ∣ B))$
$Cosine (A, B) = \frac{A \cdot B}{∣ A ∣ \times ∣ B ∣}$

if lift(A,B) >1 then A,B is effective

pattern type

Rare Patterns
Negative Patterns(Unlikely to happen together)
- Kulczynski(A,B) Less than threshold
Multilevel Associations
Multidimensional Associations

constrant type

pruning Strategies

use diff condiation to pruning the iteamset early

Monotonic(相關性):
- If c is satisfied, no need to check c again
Anti-monotonic:
- If constraint c is violated, its further mining can be terminated
Convertible:
- c can be converted to monotonic or anti-monotonic if items can be properly ordered in processing
Succinct:
- If the constraint c can be enforced by directly manipulating the dat

Sequential Patterns

items within an element are unordered and we list them alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Representative algorithms

GSP (Generalized Sequential Patterns):
- Generate “length-(k+1)” candidate sequences from “length-k” frequent sequences using Apriori
Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
Pattern-growth methods: PrefixSpan

dimensionality reduction(降维)

something