hw3 110590049

tags data

2023 Educational Data Mining and Applications HW3.pdf

8.11

$F-measure = \frac{2}{\frac{1}{p rec i s i o n} + \frac{1}{rec a ll}} = \frac{2}{\frac{TP + FP}{TP} + \frac{TP + FN}{TP}} = \frac{2}{\frac{2 TP + FP + FN}{TP}} = \frac{2 TP}{2 TP + FP + TN}$

8.12

$TPR = \frac{TP}{TP + FN} FPR = \frac{FP}{TN + FP}$

sample

threshold sample

tuple	class	probability
1	P	0.95
2	N	0.85
3	P	0.78
4	P	0.66
5	N	0.60
6	P	0.55
7	N	0.53
8	N	0.52
9	N	0.51
10	P	0.40

thresholds	TP	FP	TN	FN	FPR	TPR
0.40	5	5	0	0	1.0	1.0
0.51	4	5	0	1	1.0	0.8
0.52	4	4	1	1	0.8	0.8
0.53	4	3	2	1	0.6	0.8
0.55	4	2	3	1	0.4	0.8
0.60	3	2	3	2	0.4	0.6
0.66	3	1	4	2	0.2	0.6
0.78	2	1	4	3	0.2	0.4
0.85	1	1	4	4	0.2	0.2
0.95	1	0	5	4	0.0	0.2
1.00	0	0	5	5	0.0	0.0

ROC

8.16

change the traing dataset to balance by oversampleing the fraudulent cases or undersampling nonfraudulent cases.

by threshold-moving to reduce the error chance on majority case

9.4

	Eager Classification	Lazy Classification
Advantage	Better interpretability Better efficiency	Robust to Noise
Disadvantage	Robust to Noise Need for re-training when have new data	Vulnerability to irrelevant features Limited interpretability

9.5

def distance(a,b):
    return sum([abs(a[i]-b[i]) for i in range(len(a))])
    
def KNN(input_data,k,dataset,answer):
    distances=[]
    i=0
    for data in dataset :
        distances.append({
            "distance":distance(input_data,data),
            "answer":answer[i]
        })
        i+=1
    nesrest=sorted(distances,key=lambda x:x["distance"])[:k]
    counter={key:0 for key in answer}
    for x in nesrest:
        counter[x["answer"]]+=1
    predict={key:counter[key]/k for key in counter}
    return predict
data=[[1,2,3],[0,-1,0],[1,4,4],[1,3,4]]
answer=["a","a","b","b"]
input_data=[0,0,0]
k=3
print(KNN(input_data,k,data,answer))

something

hw3 110590049

8.11

8.12

ROC

8.16

9.4

9.5