데이터 과학 기반의 파이썬 빅데이터 분석 Chapter07 통계분석

2023. 1. 8. 23:42Python/데이터 과학 기반의 파이썬 빅데이터 분석(한빛 아카데미)

01 [기술 통계 분석 + 그래프]

와인 품질 등급 예측하기

from google.colab import files
uploaded = files.upload()
winequality-red.csv
winequality-red.csv(text/csv) - 84199 bytes, last modified: 2023. 1. 7. - 100% done
Saving winequality-red.csv to winequality-red.csv

uploaded = files.upload()
winequality-white.csv
winequality-white.csv(text/csv) - 264426 bytes, last modified: 2023. 1. 7. - 100% done
Saving winequality-white.csv to winequality-white.csv
  1. 엑셀에서 열 구분자를 세미콜론으로 인식시키기
import pandas as pd
red_df = pd.read_csv('winequality-red.csv', sep = ';', header = 0, engine = 'python')
white_df = pd.read_csv('winequality-white.csv', sep = ';', header = 0, engine = 'python')
red_df.to_csv('winequality-red2.csv', index = False)
white_df.to_csv('winequality-white2.csv', index = False)

2. 레드 와인과 화이트 와인 파일 합치기

red_df.head()
 

red_df.insert(0, column = 'type', value = 'red')
red_df.head()

red_df.shape
(1599, 13)

white_df.head()

white_df.insert(0, column = 'type', value = 'white')
white_df.head()

white_df.shape
(4898, 13)

wine = pd.concat([red_df, white_df])
wine.shape
(6497, 13)

wine.to_csv('wine.csv', index = False)

데이터 탐색

  1. 기본 정보 확인하기
print(wine.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  6497 non-null   object 
 1   fixed acidity         6497 non-null   float64
 2   volatile acidity      6497 non-null   float64
 3   citric acid           6497 non-null   float64
 4   residual sugar        6497 non-null   float64
 5   chlorides             6497 non-null   float64
 6   free sulfur dioxide   6497 non-null   float64
 7   total sulfur dioxide  6497 non-null   float64
 8   density               6497 non-null   float64
 9   pH                    6497 non-null   float64
 10  sulphates             6497 non-null   float64
 11  alcohol               6497 non-null   float64
 12  quality               6497 non-null   int64  
dtypes: float64(11), int64(1), object(1)
memory usage: 710.6+ KB
None

   2. 함수를 사용해 기술 통계 구하기

wine.columns = wine.columns.str.replace(' ', '_')
wine.head()

wine.describe()

sorted(wine.quality.unique())
[3, 4, 5, 6, 7, 8, 9]

wine.quality.value_counts()
6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

데이터 모델링 1 describe() 함수로 그룹 비교하기

wine.groupby('type')['quality'].describe()

wine.groupby('type')['quality'].mean()
type
red      5.636023
white    5.877909
Name: quality, dtype: float64

wine.groupby('type')['quality'].std()
type
red      0.807569
white    0.885639
Name: quality, dtype: float64

wine.groupby('type')['quality'].agg(['mean', 'std'])

t-검정과 회귀 분석으로 그룹 비교하기

pip install statsmodels
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: statsmodels in /usr/local/lib/python3.8/dist-packages (0.12.2)
Requirement already satisfied: pandas>=0.21 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.3.5)
Requirement already satisfied: patsy>=0.5 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (0.5.3)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.21.6)
Requirement already satisfied: scipy>=1.1 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.7.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.21->statsmodels) (2022.7)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from patsy>=0.5->statsmodels) (1.15.0)

from scipy import stats
from statsmodels.formula.api import ols, glm
red_wine_quality = wine.loc[wine['type'] == 'red', 'quality']
white_wine_quality = wine.loc[wine['type'] == 'white', 'quality']
stats.ttest_ind(red_wine_quality, white_wine_quality, equal_var = False)
Ttest_indResult(statistic=-10.149363059143164, pvalue=8.168348870049682e-24)

Rformula = 'quality ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol'
regression_result = ols(Rformula, data = wine).fit()
regression_result.summary()

회귀 분석 모델로 새로운 샘플의 품질 등급 예측하기

sample1 = wine[wine.columns.difference(['quality', 'type'])]
sample1 = sample1[0:5][:]
sample1_predict = regression_result.predict(sample1)
sample1_predict
0    4.997607
1    4.924993
2    5.034663
3    5.680333
4    4.997607
dtype: float64

wine[0:5]['quality']
0    5
1    5
2    5
3    6
4    5
Name: quality, dtype: int64

data = {"fixed_acidity": [8.5, 8.1], "volatile_acidity":[0.8, 0.5], "citric_acid":[0.3, 0.4], "residual_sugar":[6.1, 5.8],
        "chlorides":[0.055, 0.04], "free_sulfur_dioxide":[30.0, 31.0], "total_sulfur_dioxide":[98.0, 99], "density":[0.996, 0.91],
        "pH":[3.25, 3.01], "sulphates":[0.4, 0.35], "alcohol":[9.0, 0.88]}
sample2 = pd.DataFrame(data, columns=sample1.columns)
sample2

sample2_predict = regression_result.predict(sample2)
sample2_predict
0    4.809094
1    7.582129
dtype: float64

결과 시각화

와인 유형에 따른 품질 등급 히스토그램 그리기

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
sns.distplot(red_wine_quality, kde = True, color = "red", label = 'red wine')
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b26764400>

sns.distplot(white_wine_quality, kde = True, label = 'white wine')
plt.title("Quality of Wine Type")
plt.legend()
plt.show()
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

부분 회귀 플롯으로 시각화하기

import statsmodels.api as sm
others = list(set(wine.columns).difference(set(["quality", "fixed_acidity"])))
p, resids = sm.graphics.plot_partregress("quality", "fixed_acidity", others, data = wine, ret_coords = True)
plt.show()
fig = plt.figure(figsize = (8, 13))
sm.graphics.plot_partregress_grid(regression_result, fig = fig)
plt.show()

02. [상관 분석 + 히트맵]

타이타닉호 생존율 분석하기

 

데이터 수집

import seaborn as sns
import pandas as pd
titanic = sns.load_dataset("titanic")
titanic.to_csv('titanic.csv', index = False)

데이터 준비

titanic.isnull().sum()
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'].value_counts()
S    644
C    168
Q     77
Name: embarked, dtype: int64

titanic['embarked'] = titanic['embarked'].fillna('S')
titanic['embark_town'].value_counts()
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

titanic['embark_town'] = titanic['embark_town'].fillna('Southampton')
titanic['deck'].value_counts()
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

titanic['deck'] = titanic['deck'].fillna('C')
titanic.isnull().sum()
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

데이터 탐색

titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         891 non-null    category
 12  embark_town  891 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

titanic.survived.value_counts()
0    549
1    342
Name: survived, dtype: int64

import matplotlib.pyplot as plt
f, ax = plt.subplots(1, 2, figsize = (10, 5))
titanic['survived'][titanic['sex'] == 'male'].value_counts().plot.pie(explode = [0,0.1], autopct = '%1.1f%%', ax = ax[0], shadow = True)
titanic['survived'][titanic['sex'] == 'female'].value_counts().plot.pie(explode = [0,0.1], autopct = '%1.1f%%', ax = ax[1], shadow = True)
ax[0].set_title('Survived (Male)')
ax[1].set_title('Survived (Female)')
plt.show()

sns.countplot('pclass', hue = 'survived', data = titanic)
plt.title('Pclass vs Survived')
plt.show()
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

데이터 모델링

titanic_corr = titanic.corr(method = 'pearson')
titanic_corr

titanic_corr.to_csv('titanic_corr.csv', index = False)

titanic['survived'].corr(titanic['adult_male'])
-0.5570800422053257

titanic['survived'].corr(titanic['fare'])
0.2573065223849625

결과 시각화

sns.pairplot(titanic, hue = 'survived')
ValueError: object arrays are not supported

sns.catplot(x = 'pclass', y = 'survived', hue = 'sex', data = titanic, kind = 'point')
plt.show
<function matplotlib.pyplot.show(*args, **kw)>

def category_age(x):
    if x < 10:
        return 0
    elif x < 20:
        return 1
    elif x < 30:
        return 2
    elif x < 40:
        return 3
    elif x < 50:
        return 4
    elif x < 60:
        return 5
    elif x < 70:
        return 6
    else:
        return 7
titanic['age2'] = titanic['age'].apply(category_age)
titanic['sex'] = titanic['sex'].map({'male':1, 'female':0})
titanic['family'] = titanic['sibsp'] + titanic['parch'] + 1
titanic.to_csv('titanic3.csv', index = False)
heatmap_data = titanic[['survived', 'sex', 'age2', 'family', 'pclass', 'fare']]
colormap = plt.cm.RdBu
sns.heatmap(heatmap_data.astype(float).corr(), linewidths = 0.1, vmax = 1.0, square = True, cmap = colormap, linecolor = 'white', annot = True, annot_kws = {"size": 10})
plt.show()