티스토리 뷰
01 [기술 통계 분석 + 그래프]
와인 품질 등급 예측하기
from google.colab import files
uploaded = files.upload()
winequality-red.csv
winequality-red.csv(text/csv) - 84199 bytes, last modified: 2023. 1. 7. - 100% done
Saving winequality-red.csv to winequality-red.csv
uploaded = files.upload()
winequality-white.csv
winequality-white.csv(text/csv) - 264426 bytes, last modified: 2023. 1. 7. - 100% done
Saving winequality-white.csv to winequality-white.csv
- 엑셀에서 열 구분자를 세미콜론으로 인식시키기
import pandas as pd
red_df = pd.read_csv('winequality-red.csv', sep = ';', header = 0, engine = 'python')
white_df = pd.read_csv('winequality-white.csv', sep = ';', header = 0, engine = 'python')
red_df.to_csv('winequality-red2.csv', index = False)
white_df.to_csv('winequality-white2.csv', index = False)
2. 레드 와인과 화이트 와인 파일 합치기
red_df.head()
red_df.insert(0, column = 'type', value = 'red')
red_df.head()
red_df.shape
(1599, 13)
white_df.head()
white_df.insert(0, column = 'type', value = 'white')
white_df.head()
white_df.shape
(4898, 13)
wine = pd.concat([red_df, white_df])
wine.shape
(6497, 13)
wine.to_csv('wine.csv', index = False)
데이터 탐색
- 기본 정보 확인하기
print(wine.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null object
1 fixed acidity 6497 non-null float64
2 volatile acidity 6497 non-null float64
3 citric acid 6497 non-null float64
4 residual sugar 6497 non-null float64
5 chlorides 6497 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6497 non-null float64
10 sulphates 6497 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 710.6+ KB
None
2. 함수를 사용해 기술 통계 구하기
wine.columns = wine.columns.str.replace(' ', '_')
wine.head()
wine.describe()
sorted(wine.quality.unique())
[3, 4, 5, 6, 7, 8, 9]
wine.quality.value_counts()
6 2836
5 2138
7 1079
4 216
8 193
3 30
9 5
Name: quality, dtype: int64
데이터 모델링 1 describe() 함수로 그룹 비교하기
wine.groupby('type')['quality'].describe()
wine.groupby('type')['quality'].mean()
type
red 5.636023
white 5.877909
Name: quality, dtype: float64
wine.groupby('type')['quality'].std()
type
red 0.807569
white 0.885639
Name: quality, dtype: float64
wine.groupby('type')['quality'].agg(['mean', 'std'])
t-검정과 회귀 분석으로 그룹 비교하기
pip install statsmodels
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: statsmodels in /usr/local/lib/python3.8/dist-packages (0.12.2)
Requirement already satisfied: pandas>=0.21 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.3.5)
Requirement already satisfied: patsy>=0.5 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (0.5.3)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.21.6)
Requirement already satisfied: scipy>=1.1 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.7.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.21->statsmodels) (2022.7)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from patsy>=0.5->statsmodels) (1.15.0)
from scipy import stats
from statsmodels.formula.api import ols, glm
red_wine_quality = wine.loc[wine['type'] == 'red', 'quality']
white_wine_quality = wine.loc[wine['type'] == 'white', 'quality']
stats.ttest_ind(red_wine_quality, white_wine_quality, equal_var = False)
Ttest_indResult(statistic=-10.149363059143164, pvalue=8.168348870049682e-24)
Rformula = 'quality ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol'
regression_result = ols(Rformula, data = wine).fit()
regression_result.summary()
회귀 분석 모델로 새로운 샘플의 품질 등급 예측하기
sample1 = wine[wine.columns.difference(['quality', 'type'])]
sample1 = sample1[0:5][:]
sample1_predict = regression_result.predict(sample1)
sample1_predict
0 4.997607
1 4.924993
2 5.034663
3 5.680333
4 4.997607
dtype: float64
wine[0:5]['quality']
0 5
1 5
2 5
3 6
4 5
Name: quality, dtype: int64
data = {"fixed_acidity": [8.5, 8.1], "volatile_acidity":[0.8, 0.5], "citric_acid":[0.3, 0.4], "residual_sugar":[6.1, 5.8],
"chlorides":[0.055, 0.04], "free_sulfur_dioxide":[30.0, 31.0], "total_sulfur_dioxide":[98.0, 99], "density":[0.996, 0.91],
"pH":[3.25, 3.01], "sulphates":[0.4, 0.35], "alcohol":[9.0, 0.88]}
sample2 = pd.DataFrame(data, columns=sample1.columns)
sample2
sample2_predict = regression_result.predict(sample2)
sample2_predict
0 4.809094
1 7.582129
dtype: float64
결과 시각화
와인 유형에 따른 품질 등급 히스토그램 그리기
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
sns.distplot(red_wine_quality, kde = True, color = "red", label = 'red wine')
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b26764400>
sns.distplot(white_wine_quality, kde = True, label = 'white wine')
plt.title("Quality of Wine Type")
plt.legend()
plt.show()
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
부분 회귀 플롯으로 시각화하기
import statsmodels.api as sm
others = list(set(wine.columns).difference(set(["quality", "fixed_acidity"])))
p, resids = sm.graphics.plot_partregress("quality", "fixed_acidity", others, data = wine, ret_coords = True)
plt.show()
fig = plt.figure(figsize = (8, 13))
sm.graphics.plot_partregress_grid(regression_result, fig = fig)
plt.show()
02. [상관 분석 + 히트맵]
타이타닉호 생존율 분석하기
데이터 수집
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset("titanic")
titanic.to_csv('titanic.csv', index = False)
데이터 준비
titanic.isnull().sum()
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'].value_counts()
S 644
C 168
Q 77
Name: embarked, dtype: int64
titanic['embarked'] = titanic['embarked'].fillna('S')
titanic['embark_town'].value_counts()
Southampton 644
Cherbourg 168
Queenstown 77
Name: embark_town, dtype: int64
titanic['embark_town'] = titanic['embark_town'].fillna('Southampton')
titanic['deck'].value_counts()
C 59
B 47
D 33
E 32
A 15
F 13
G 4
Name: deck, dtype: int64
titanic['deck'] = titanic['deck'].fillna('C')
titanic.isnull().sum()
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 0
class 0
who 0
adult_male 0
deck 0
embark_town 0
alive 0
alone 0
dtype: int64
데이터 탐색
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 891 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 891 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 891 non-null category
12 embark_town 891 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
titanic.survived.value_counts()
0 549
1 342
Name: survived, dtype: int64
import matplotlib.pyplot as plt
f, ax = plt.subplots(1, 2, figsize = (10, 5))
titanic['survived'][titanic['sex'] == 'male'].value_counts().plot.pie(explode = [0,0.1], autopct = '%1.1f%%', ax = ax[0], shadow = True)
titanic['survived'][titanic['sex'] == 'female'].value_counts().plot.pie(explode = [0,0.1], autopct = '%1.1f%%', ax = ax[1], shadow = True)
ax[0].set_title('Survived (Male)')
ax[1].set_title('Survived (Female)')
plt.show()
sns.countplot('pclass', hue = 'survived', data = titanic)
plt.title('Pclass vs Survived')
plt.show()
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
데이터 모델링
titanic_corr = titanic.corr(method = 'pearson')
titanic_corr
titanic_corr.to_csv('titanic_corr.csv', index = False)
titanic['survived'].corr(titanic['adult_male'])
-0.5570800422053257
titanic['survived'].corr(titanic['fare'])
0.2573065223849625
결과 시각화
sns.pairplot(titanic, hue = 'survived')
ValueError: object arrays are not supported
sns.catplot(x = 'pclass', y = 'survived', hue = 'sex', data = titanic, kind = 'point')
plt.show
<function matplotlib.pyplot.show(*args, **kw)>
def category_age(x):
if x < 10:
return 0
elif x < 20:
return 1
elif x < 30:
return 2
elif x < 40:
return 3
elif x < 50:
return 4
elif x < 60:
return 5
elif x < 70:
return 6
else:
return 7
titanic['age2'] = titanic['age'].apply(category_age)
titanic['sex'] = titanic['sex'].map({'male':1, 'female':0})
titanic['family'] = titanic['sibsp'] + titanic['parch'] + 1
titanic.to_csv('titanic3.csv', index = False)
heatmap_data = titanic[['survived', 'sex', 'age2', 'family', 'pclass', 'fare']]
colormap = plt.cm.RdBu
sns.heatmap(heatmap_data.astype(float).corr(), linewidths = 0.1, vmax = 1.0, square = True, cmap = colormap, linecolor = 'white', annot = True, annot_kws = {"size": 10})
plt.show()
'Python > 데이터 과학 기반의 파이썬 빅데이터 분석(한빛 아카데미)' 카테고리의 다른 글
데이터 과학 기반의 파이썬 빅데이터 분석 Chapter09 지리 정보 분석 (0) | 2023.01.09 |
---|---|
데이터 과학 기반의 파이썬 빅데이터 분석 Chapter08 텍스트 빈도 분석 (0) | 2023.01.08 |
데이터 과학 기반의 파이썬 빅데이터 분석 Chapter06 파이썬 크롤링 - 라이브러리 이용 (0) | 2023.01.06 |
데이터 과학 기반의 파이썬 빅데이터 분석 Chapter05 파이썬 크롤링-API 이용 (2) | 2023.01.05 |
데이터 과학 기반의 파이썬 빅데이터 분석 Chapter04 파이썬 프로그래밍 기초 연습문제 (0) | 2023.01.05 |
반응형
공지사항
최근에 올라온 글
최근에 달린 댓글
- Total
- Today
- Yesterday
링크
TAG
- EDA
- 쿼리 테스트
- 프로그래머스
- lv4
- mysql
- 데이터 분석
- 딥러닝
- 태블로
- 프로그래밍
- SQL
- 머신러닝
- 알고리즘
- 데이터사이언스
- sql 테스트
- Python
- 인공지능
- Kaggle
- 파이썬
- Lv3
- 데이터 시각화
- 캐글
- 부스트코스
- LV2
- ML
- API
- ai
- 데이터분석
- nlp
- LV1
- SQLD
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
글 보관함