[이산수학] 데이터 정리와 확률

Posted Nov 24, 2023

By 엉덩희 7 min read

이번 시간에는 데이터 정리와 확률에 대해서 인공지능에 접근해보겠습니다!

평균mean

데이터를 모두 더한 후 갯수로 나누 대푯값이며 가장 많이 사용되고 이상치에 민감하다는 특징이 있습니다.

$Desktop View$

중앙값median

데이터를 크기 순서로 나열한 후 가장 가운데 있는 값

$Desktop View$

분산variance

평균으로부터 퍼짐의 정도를 숫자로 표현

np.random.seed(0)
scores1 = np.random.randint(30, 100, 10)

scores2 = np.random.randint(50, 90, 10)
#scores1과 2는 평균이 같다

deviations1 = scores1 - scores1.mean()
deviations2 = scores2 - scores2.mean()

fig, ax = plt.subplots(figsize=(10,5), nrows=1, ncols=2, sharey=True)

ax[0].bar(np.arange(10), deviations1, color='C1', edgecolor='k')
ax[0].set_title('scores 1 deviations')
ax[1].bar(np.arange(10), deviations2, color='C2', edgecolor='k')
ax[1].set_title('scores 2 deviations')
plt.show()

$Desktop View$ $Desktop View$

평균이 같은데도 분산은 다르게 나타날 수 있습니다.

# numpy 기능으로 분산 바로 구하기 [+]
print(scores1.var())
print(scores2.var())

또한 분산은 평균으로부터 퍼짐의 정도를 한변으로 하는 사각형의 평균 넓이가 될 수 있습니다.

from matplotlib.patches import Rectangle

fig = plt.figure(dpi=100)
ax = plt.axes()

#          0:-   1:+
colors = ['C1', 'C2']

covs = [ Rectangle( (0,0), x, x, edgecolor='k', 
                   facecolor=colors[1], alpha=0.3) 
            for x in deviations2 ]

for cov in covs:
    ax.add_patch(cov)
    
ax.plot(deviations2, np.zeros_like(deviations2), 'o', color='k')
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
ax.axis('equal')
plt.show()

$Desktop View$

이렇게 나타났을 때 사각형이 1,3분면에만 나타난 것을 보고 양의 상관관계까지 추론할 수 있겠죠?

히스토그램

데이터를 계급으로 나눠 계급에 해당하는 빈도수를 막대그래프로 그린 그래프입니다.

np.random.seed(0)
scores = np.abs((np.random.randn(500)*13)-65).astype(int)
scores[scores>=100] = 100

hist, bins = np.histogram(scores, bins=10, range=(0,100))

fig = plt.figure()
ax = plt.axes()

ax.hist(scores, bins=10, range=(0,100), color='C1', edgecolor='k')
ax.set_xlabel('scores')
ax.set_ylabel('# of students per 10 point interval')
ax.set_title('bins=10')
plt.show()

$Desktop View$

여기서 bins=를 더 크게주면 잘게 나눠지므로 그래프로 좀더 촘촘하게 나옵니다.

상자그래프 Boxplot

박스플롯은 전체 데이터에서 비율을 정해서 펜스를 min과 max각각 주어서 그것을 넘어가는 이상치를 구분할 수 있게 해줍니다.

D = np.array([1,2,3,4,5,6,7,8,9,10])

Q1 = np.percentile(D, 25, interpolation='nearest')  #interpolation: 옵션
Q2 = np.percentile(D, 50)
Q3 = np.percentile(D, 75, interpolation='nearest')

IQR = Q3 - Q1

upper_fence = Q3 + 1.5*IQR
upper_whisker = np.max(D[D<upper_fence])

lower_fence = Q1 - 1.5*IQR
upper_whisker = np.min(D[D>lower_fence])

fig = plt.figure(dpi=100)
ax = plt.axes()

ret = ax.boxplot(D)

plt.show()

$Desktop View$