data science
5v
- volume
- variety
- value
- velocity
- veracity
data processing chain
- data
- database
- data store
- data ming
- data value
data warehouse
- subject oriented
- integrate
- non-volatile
- time variant
data warehouse type
- 星形
- 雪花
- 實例星座
data type
| name | exmple |
|---|---|
| nominal data , unordered collection,明目 | dog cat name |
| ordinal data , ordered data ,次序 | level |
| interval data , equal intervals,(not Ture zero)(相除 不可代表比值),區間 | 溫度 |
| ratio data , fraction data,比值 | float |
| BLOB(binary large objects) | image sound file |
Quantile
Q1:small 0%~25%
Q2:small 25%~50%
Q3:small 50%~75%
Q4:small 75%~100%(the biggest)
interQuartile range:Q3-Q1
confusion matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive (True) | TP | FN |
| Actual Negative (False) | FP | TN |
ROC curve
correlations
linear regression
similar function
| h1 | h2 | h3 | h4 | |
|---|---|---|---|---|
| a | 1 | _ | _ | 0.5 |
| b | 0.5 | _ | 0.3 | _ |
| c | _ | 0.7 | _ | 0.4 |
jaccard similarity
cosine similarity
peason correlation coeffiecient
Feature Selection Methods
- Filter methods
- Wrapper methods
- Embedded methods
Classification
Decision Tree
entropy(熵)
Information Gain(IG)
if then entropy is geting less then it is good divide point
gini index(Impurity)
random forest(Bootstrap Aggregation for decision tree)
use mutilple random Decision Tree to vote asnwer
naive bayes
zero conditional(when probability is zero)
t :possible condition number
n :case number
c :possble c
m :virtual sample
Bayesian Network
Bayesian networks are directed and acyclic, whereas Markov networks are undirected and could be cyclic.
SVM(Support Vector Machines)
linear separable SVM
linear inseparable SVM
map data to higher dimension
KNN( K Nearest Neighbor)
- KNN 的演算法步驟如下:
- KNN 中的 K 代表找出距離你最近的 K 筆資料進行分類。
- 假設 K 是 3,代表找出前三筆相近的資料。
- 當這 K 筆資料被選出後,就以投票的方式統計哪一種類的資料量最多,然後就被分到哪一個類別。
Linear Classifiers
Logistic Regression
Gradient Descent
Coordinate Descent
Coordinate descent updates one parameter at a time, while gradient descent attempts to update all parameters at once
Improve Classification Accuracy
Bagging:Bootstrap aggregating
給定一個大小為 n 的訓練集 D,Bagging算法從中均勻、有放回地(即使用自助抽樣法)選出 m個大小為 的子集 ,作為新的訓練集。在這 m 個訓練集上使用分類、回歸等算法,則可得到 m 個模型,在透過取平均值、取多數票等方法,即可得到Bagging的結果。
Boosting
- A series of k classifiers are iteratively learned
- After a classifier Mi is learned, set the subsequent classifier, Mi+1, to pay more
- attention to the training tuples that were misclassified by Mi
- The final M* combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy
Imbalanced Data Sets
- Oversampling: Oversample the minority class.
- Under-sampling: Randomly eliminate tuples from majority class
- Synthesizing: Synthesize new minority classes
- Threshold-moving
- Move the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors
MLP(ANN)
- 人工神经元Artificial Neural Unit
- 多层感知机Multi-Layer Perception(MLP)
- 激活函数Activation function
- 反向传播Back Propagation
- 学习率Learning Rate
- 损失函数Loss Function(MSE均方误差,Cross Entropy(CE)交叉熵:softmax函数)
- 权值初始化Initialization (Xavier初始化,MSRA初始化)
- 正则化Regularization(Dropout随机失活)
- training set
- 60% data
- validation set
- 20% data
- Testing set
- 20% data
k fold cross validtion
在 K-Fold 的方法中我們會將資料切分為 K 等份,K 是由我們自由調控的,以下圖為例:假設我們設定 K=10,也就是將訓練集切割為十等份。這意味著相同的模型要訓練十次,每一次的訓練都會從這十等份挑選其中九等份作為訓練資料,剩下一等份未參與訓練並作為驗證集。
Activation Function
| Activation Function | Formula | Range |
|---|---|---|
| Tanh | ||
| Sigmoid (Logistic) | ||
| ReLU | ||
| SoftPlus |
scale features (nomralize)
Certainly! Here’s a markdown table summarizing the formulas for different types of feature scaling:
| Scaling Method | Formula | Range |
|---|---|---|
| Z-Score Normalization | ||
| Min-Max Scaling | ||
| Maximum Absolute Scaling | ||
| Robust Scaling |
Clustering Algorithms
- Partition algorithms
- K means clustering
- Gaussian Mixture Model
- Hierarchical algorithms
- Bottom-up, agglomerative
- Top-down, divisive
K means
give K point move point to find position fit all data
K-Medoids
group some point and find the mass center if the point is close merge in to the group if is far then remove from the group
Hierarchical Clustering(Bottom-up)
merge all point by dstance of clusters
distance of clusters
- single-link(closest point of group)
- complete-link(farthest point of group)
- centroid(centers point of group)
- average-link (average distance between of all elements of group)
weight vs unweight
use point number of group as weight or averge two group distance
DBSCAN (Bottom-up)
- start with every point as group it self
- if a,b group are in range of “high densty area” then merge as same group
pattern mining
FP tree
rule (is the itemset are frequent itemset)
if lift(A,B) >1 then A,B is effective
pattern type
- Rare Patterns
- Negative Patterns(Unlikely to happen together)
- Kulczynski(A,B) Less than threshold
- Multilevel Associations
- Multidimensional Associations
constrant type
pruning Strategies
use diff condiation to pruning the iteamset early
- Monotonic(相關性):
- If c is satisfied, no need to check c again
- Anti-monotonic:
- If constraint c is violated, its further mining can be terminated
- Convertible:
- c can be converted to monotonic or anti-monotonic if items can be properly ordered in processing
- Succinct:
- If the constraint c can be enforced by directly manipulating the dat
Sequential Patterns
items within an element are unordered and we list them alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Representative algorithms
- GSP (Generalized Sequential Patterns):
- Generate “length-(k+1)” candidate sequences from “length-k” frequent sequences using Apriori
- Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
- Pattern-growth methods: PrefixSpan