2023. 4. 8. 23:41ㆍBOOTCAMP/프로그래머스 인공지능 데브코스
데이터 셋 선정하기
Brazilian E-Commerce Public Dataset by Olist
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Brazilian E-Commerce Public Dataset by Olist
100,000 Orders with product, customer and reviews info
www.kaggle.com
라이브러리
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
데이터 불러오기
customers_df = pd.read_csv('/Users/Desktop/archive/olist_customers_dataset.csv')
sellers_df = pd.read_csv('/Users/Desktop/archive/olist_sellers_dataset.csv')
review_df = pd.read_csv('/Users/Desktop/archive/olist_order_reviews_dataset.csv')
items_df = pd.read_csv('/Users/Desktop/archive/olist_order_items_dataset.csv')
products_df = pd.read_csv('/Users/Desktop/archive/olist_products_dataset.csv')
geolocation_df = pd.read_csv('/Users/Desktop/archive/olist_geolocation_dataset.csv')
category_df = pd.read_csv('/Users/Desktop/archive/product_category_name_translation.csv')
orders_df = pd.read_csv('/Users/Desktop/archive/olist_orders_dataset.csv')
payments_df = pd.read_csv('/Users/Desktop/archive/olist_order_payments_dataset.csv')
EDA
- customers_df
customers_df.head()
customers_df.describe()
![](https://blog.kakaocdn.net/dn/lgvCi/btr8T5yluSO/wCVN60XklH3h9TIKUBNg1k/img.png)
customers_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 99441 non-null object
1 customer_unique_id 99441 non-null object
2 customer_zip_code_prefix 99441 non-null int64
3 customer_city 99441 non-null object
4 customer_state 99441 non-null object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB
customers_df['customer_city'].unique()
array(['franca', 'sao bernardo do campo', 'sao paulo',...,
'monte bonito', 'sao rafael', 'eugenio de castro'], dtype=object)
customers_df.count()
customer_id 99441
customer_unique_id 99441
customer_zip_code_prefix 99441
customer_city 99441
customer_state 99441
dtype: int64
customers_df.isna().sum()
customer_id 0
customer_unique_id 0
customer_zip_code_prefix 0
customer_city 0
customer_state 0
dtype: int64
- sellers_df
sellers_df.head()
sellers_df.describe()
sellers_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 seller_id 3095 non-null object
1 seller_zip_code_prefix 3095 non-null int64
2 seller_city 3095 non-null object
3 seller_state 3095 non-null object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB
sellers_df['seller_city'].unique()
array(['campinas', 'mogi guacu', 'rio de janeiro', 'sao paulo',
'braganca paulista', 'brejao', 'penapolis', 'curitiba', 'anapolis',
'itirapina', 'tubarao', 'lauro de freitas', 'imbituba', 'brasilia',
'porto seguro', 'guaruja', 'tabatinga', 'salto', 'tres de maio',
'belo horizonte', 'arapongas', 'sao bernardo do campo', 'tatui',
'garopaba', 'camanducaia', 'tupa', 'guarulhos',
'sao jose dos pinhais', 'sao ludgero', 'sao jose', 'piracicaba',
'porto alegre', 'congonhal', 'santo andre', 'osasco', 'valinhos',
'joinville', 'saquarema', 'barra velha', 'petropolis',
'santo antonio da patrulha', 'ponta grossa', 'santana de parnaiba',
'sao carlos', 'ibitinga', 'barueri', 'caxias do sul', 'araguari',
'contagem', 'jaragua do sul', 'lages - sc', 'bento goncalves',
'catanduva', 'ribeirao pires', 'jaboticabal', 'echapora', 'cotia',
'rio do sul', 'sorocaba', 'pradopolis', 'itaborai', 'mirassol',
'birigui', 'assis', 'jaguariuna', 'araraquara', 'macae',
'rio claro', 'londrina', 'ribeirao preto', 'tres coracoes',
'nhandeara', 'orleans', 'cuiaba', 'formosa do oeste', 'santos',
'santa terezinha de goias', 'arvorezinha', 'guiricema', 'caruaru',
'franca', 'salvador', 'diadema', 'itaquaquecetuba',
'lencois paulista', 'carapicuiba', 'uruacu', 'itajai', 'loanda',
'maringa', 'ferraz de vasconcelos', 'limeira', 'claudio',
'niteroi', 'osvaldo cruz', 'sao goncalo', 'jaciara',
'balenario camboriu', 'timbo', 'jacutinga', 'fortaleza',
'ferraz de vasconcelos', 'mirandopolis', 'bauru', 'jacarei',
'itu', 'laranjeiras do sul', 'videira', 'florianopolis',
'itapecerica da serra', 'mamanguape', 'ponte nova', 'goioere',
'pederneiras', 'itapevi', 'goiania', 'campina grande',
'estancia velha', 'resende', 'maua', 'caratinga', 'auriflama/sp',
'cafelandia', 'uba', 'sao paulo / sao paulo',
'sao jose do rio preto', 'porto ferreira', 'tres coroas',
'blumenau', 'mogi das cruzes', 'jaci', 'laranjal paulista',
'americana', 'sertanopolis', 'apucarana', 'colombo',
'vicente de carvalho', 'mesquita', 'sao pauo', 'cascavel',
'fazenda rio grande', 'taboao da serra', 'sao jose dos campos',
'toledo', 'marechal candido rondon', 'jundiai', 'mandirituba',
'suzano', 'vespasiano', 'santa rosa', 'sao joaquim da barra',
'santo antonio de posse', 'uruguaiana', 'campanha', 'piracanjuba',
'concordia', 'santa rita do sapucai', 'barretos', 'indaiatuba',
'nilopolis', 'pompeia', 'barro alto', 'são paulo', 'praia grande',
'luiz alves', 'brusque', 'criciuma', 'jales', 'atibaia',
'rio branco', 'barra mansa', 'marilia', 'bahia', 'taubate',
'cascavael', 'monteiro lobato', 'viana', 'paraiba do sul',
'mococa', 'sao roque', 'passos', 'francisco beltrao', 'tocantins',
'porto belo', 'nova iguacu', 'icara', 'lajeado', 'horizontina',
'votorantim', 'campo bom', 'monte alto', 'fernandopolis',
'pedreira', 'poa', 'divinopolis', 'santa barbara d´oeste',
'canoas', 'mombuca', 'sete lagoas', 'campo do meio',
'cordeiropolis', 'uberlandia', 'santa barbara d oeste',
'volta redonda', '04482255', 'aracatuba', 'monte siao', 'garuva',
'bonfinopolis de minas', 'cosmopolis', 'pocos de caldas',
'artur nogueira', 'joao pessoa', 'dois corregos', 'araquari',
'novo hamburgo, rio grande do sul, brasil', 'floranopolis',
'sumare', 'guaira', 'cachoeiro de itapemirim', 'serrana',
'rolandia', 'congonhas', 'sao jose dos pinhais', 'boituva',
'mairipora', 'guaimbe', 'parai', 'aperibe', 'jaguaruna',
'vila velha', 'juiz de fora', 'fronteira', 'novo horizonte',
'pilar do sul', 'itajobi', 'cariacica / es', 'prados', 'mucambo',
'montes claros', 'vicosa', 'porto velho', 'sao jose do rio pardo',
'pato branco', 'sao joao del rei', 'presidente prudente',
'paracambi', 'serra negra', 'sao caetano do sul', 'bom jardim',
'serra redonda', 'sao francisco do sul', 'betim', 'imbituva',
'guaratuba', 'teresina', "sao miguel d'oeste", 'california',
'japira', 'foz do iguacu', 'nova friburgo', 'itau de minas',
'oliveira', 'sabara', 'pedrinhas paulista', 'votuporanga',
'holambra', 'ararangua', 'pinhais', 'pato bragado', 'carazinho',
'arinos', 'sao pedro', 'lages', 'ampere', 'itauna', 'mogi mirim',
'curitibanos', 'brasilia df', 'mogi das cruses', 'hortolandia',
'ipatinga', 'laguna', 'dores de campos', 'sao paulo - sp',
'araras', 'divisa nova', 'igaracu do tiete', 'pitangueiras',
'campo grande', 'garca', 'presidente epitacio', 'sbc/sp',
"arraial d'ajuda (porto seguro)", 'pacatuba', 'formosa',
'borda da mata', 'ubatuba', 'entre rios do oeste', 'formiga',
'venancio aires', 'navegantes', 'cruzeiro', 'santa maria',
'muriae', 'santo andre/sao paulo', 'ipe', 'messias targino',
'varginha', 'botucatu', 'domingos martins', 'uberaba',
'coronel fabriciano', 'cachoeirinha', 's jose do rio preto',
'taruma', 'pirassununga', 'aruja', 'sp / sp', 'angra dos reis',
'juzeiro do norte', 'laurentino', 'flores da cunha', 'montenegro',
'pedregulho', 'novo hamburgo', 'torres', 'aracaju',
'santa catarina', 'joao pinheiro', 'bady bassitt', 'sinop',
'guarapuava', 'araucaria', 'vitoria', 'batatais', 'lagoa santa',
'chapeco', 'umuarama', 'belford roxo', 'cariacica',
'monte alegre do sul', 'sp', 'lagoa da prata', 'rolante',
'teresopolis', 'itaporanga', 'campo largo', 'sao joao de meriti',
'maua/sao paulo', 'bom jesus dos perdoes', 'brotas', 'irece',
'coxim', 'jau', 'conselheiro lafaiete', 'amparo',
'sao miguel do oeste', 'gaspar', 'rio bonito', 'mandaguari',
'vargem grande paulista', 'conchal', 'cambe', 'marialva',
'alfenas', 'balneario camboriu', 'palhoca', 'sao bernardo do capo',
'guara', 'colatina', 'franco da rocha', 'lambari',
'mogi das cruzes / sp', 'treze tilias',
'rio de janeiro \\rio de janeiro', 'paulo lopes', 'santa cecilia',
'braco do norte', 'floresta', 'farroupilha', 'castro', 'luziania',
'joao monlevade', 'pelotas', 'sao bento', 'campos dos goytacazes',
'ouro fino', 'sao jose dos pinhas', 'tiete', 'viamao', 'janauba',
'capivari', 'santa terezinha de itaipu', 'igrejinha',
'sao bento do sul', 'duque de caxias', 'araxa', 'canoinhas',
'recife', 'barbacena/ minas gerais', 'vera cruz', 'parnamirim',
'santo angelo', 'paincandu', 'tres rios', 'tanabi',
'portoferreira', 'itatiba', 'sarandi', 'cravinhos', 'morrinhos',
'bebedouro', 'almirante tamandare', 'bertioga', 'natal',
'belo horizont', 'ivoti', 'andira-pr', 'cerqueira cesar',
'marapoama', 'imigrante', 'mairinque', 'sao paulo sp',
'rio de janeiro / rio de janeiro', 'andradas', 'sando andre',
'nova odessa', 'paulinia', 'extrema', 'olimpia',
'angra dos reis rj', 'ronda alta', 'sao paulo', 'sao vicente',
'pinhais/pr', 'portao', 'registro', 'ao bernardo do campo',
'carmo do cajuru', 'embu das artes', 'fernando prestes',
'castro pires', 'vargem grande do sul', 'campina das missoes',
'sao pedro da aldeia', 'miguelopolis', 'itapui', 'sbc', 'neopolis',
'mineiros do tiete', 'varzea paulista', 'nova lima', 'barbacena',
'caieiras', 'buritama', 'erechim', 'itapetininga', 'pinhalzinho',
'descalvado', 'pitanga', 'bage', 'taio', "santa barbara d'oeste",
'patos de minas', 'garulhos', 'jarinu', 'nova petropolis',
'ribeirao preto / sao paulo', 'camboriu', 'nova trento',
'sao luis', 'sao jose do rio pret', 'eusebio', 'itaipulandia',
'ipira', 'campo magro', 'tiradentes', 'sao paluo', 'baependi',
'embu guacu', 'paraiso do sul', 'aparecida', 'cataguases',
'bariri', 'abadia de goias', 'alambari', 'ji parana', 'vassouras',
'lorena', 'rodeio', 'louveira', 'guanhaes',
'santo antonio de padua', 'presidente getulio', 'campos novos',
'eunapolis', 'engenheiro coelho', 'rio das pedras',
'afonso claudio', 'carapicuiba / sao paulo', 'centro', 'parana',
'indaial', 'bombinhas', 'orlandia', 'itapeva', 'sao sebastiao',
'macatuba', 'sao joao da boa vista', 'teixeira soares',
'mandaguacu', 'rio do oeste', 'vendas@creditparts.com.br',
'armacao dos buzios', 'mateus leme', 'sao paulop',
'campo limpo paulista', 'socorro', 'serra', 'bocaiuva do sul',
'ilheus', 'imbe', 'soledade', 'cajamar', 'rio negrinho',
'clementina', 'francisco morato', 'rio grande', 'xaxim', 'manaus',
'terra boa', 'minas gerais', 'avare', 'ibirite',
'santa maria da serra', 'auriflama', 'condor', 'ibia', 'guanambi',
'caucaia', 'cordilheira alta', 'carmo da mata', 'ouro preto',
'pedro leopoldo', 'santa rosa de viterbo', 'xanxere',
'alvares machado', 'scao jose do rio pardo', 'ribeirao das neves',
'medianeira', 'massaranduba', 'cornelio procopio', 'pirituba',
'jambeiro', 'sao leopoldo', 'aguas claras df', 'ribeirao pretp',
'cianorte', 'feira de santana', 'cachoeira do sul', 'guariba',
'sao sebastiao da grama/sp', 'dracena', 'ourinhos',
'robeirao preto', 'cacador', 'gama', 'queimados', 'cananeia',
'presidente bernardes', 'pinhalao', 'sombrio', 'campo mourao',
'ilicinea', 'itabira', 'barrinha', 'jussara', 'uniao da vitoria',
'triunfo', 'santa cruz do sul', 'colorado', 'itapema', 'sapiranga',
'paranavai', 'alvorada', 'ipaussu', 'rio verde', 'mage',
'tabao da serra', 'bofete', 'picarras', 'marica', 'jaragua',
'governador valadares', 'rio de janeiro, rio de janeiro, brasil',
'pouso alegre', 'timoteo', 'muqui', 'ipua', 'jacarei / sao paulo',
'varzea alegre', 'guaratingueta', 'tambau', 'irati',
'riberao preto', 'aparecida de goiania', 'bandeirantes',
'vitoria de santo antao', 'palotina', 'leme'], dtype=object)
sellers_df.count()
seller_id 3095
seller_zip_code_prefix 3095
seller_city 3095
seller_state 3095
dtype: int64
sellers_df.isna().sum()
seller_id 0
seller_zip_code_prefix 0
seller_city 0
seller_state 0
dtype: int64
- review_df
review_df.head()
review_df.describe()
review_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_id 99224 non-null object
1 order_id 99224 non-null object
2 review_score 99224 non-null int64
3 review_comment_title 11568 non-null object
4 review_comment_message 40977 non-null object
5 review_creation_date 99224 non-null object
6 review_answer_timestamp 99224 non-null object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB
review_df['review_score'].unique
array([4, 5, 1, 3, 2])
review_df.count()
review_id 99224
order_id 99224
review_score 99224
review_comment_title 11568
review_comment_message 40977
review_creation_date 99224
review_answer_timestamp 99224
dtype: int64
review_df.isna().sum()
review_id 0
order_id 0
review_score 0
review_comment_title 87656
review_comment_message 58247
review_creation_date 0
review_answer_timestamp 0
dtype: int64
- items_df
items_df.head()
items_df.describe()
items_df.corr()
items_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 112650 non-null object
1 order_item_id 112650 non-null int64
2 product_id 112650 non-null object
3 seller_id 112650 non-null object
4 shipping_limit_date 112650 non-null object
5 price 112650 non-null float64
6 freight_value 112650 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB
items_df['shipping_limit_date'].unique()
array(['2017-09-19 09:45:35', '2017-05-03 11:05:13',
'2018-01-18 14:48:30', ..., '2017-10-30 17:14:25',
'2017-08-21 00:04:32', '2018-06-12 17:10:13'], dtype=object)
items_df['price'].unique()
array([ 58.9 , 239.9 , 199. , ..., 7.84, 399.85, 736. ])
items_df.count()
order_id 112650
order_item_id 112650
product_id 112650
seller_id 112650
shipping_limit_date 112650
price 112650
freight_value 112650
dtype: int64
items_df.isna().sum()
order_id 0
order_item_id 0
product_id 0
seller_id 0
shipping_limit_date 0
price 0
freight_value 0
dtype: int64
- products_df
products_df.head()
products_df.describe()
products_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_id 32951 non-null object
1 product_category_name 32341 non-null object
2 product_name_lenght 32341 non-null float64
3 product_description_lenght 32341 non-null float64
4 product_photos_qty 32341 non-null float64
5 product_weight_g 32949 non-null float64
6 product_length_cm 32949 non-null float64
7 product_height_cm 32949 non-null float64
8 product_width_cm 32949 non-null float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB
products_df['product_category_name'].unique()
array(['perfumaria', 'artes', 'esporte_lazer', 'bebes',
'utilidades_domesticas', 'instrumentos_musicais', 'cool_stuff',
'moveis_decoracao', 'eletrodomesticos', 'brinquedos',
'cama_mesa_banho', 'construcao_ferramentas_seguranca',
'informatica_acessorios', 'beleza_saude', 'malas_acessorios',
'ferramentas_jardim', 'moveis_escritorio', 'automotivo',
'eletronicos', 'fashion_calcados', 'telefonia', 'papelaria',
'fashion_bolsas_e_acessorios', 'pcs', 'casa_construcao',
'relogios_presentes', 'construcao_ferramentas_construcao',
'pet_shop', 'eletroportateis', 'agro_industria_e_comercio', nan,
'moveis_sala', 'sinalizacao_e_seguranca', 'climatizacao',
'consoles_games', 'livros_interesse_geral',
'construcao_ferramentas_ferramentas',
'fashion_underwear_e_moda_praia', 'fashion_roupa_masculina',
'moveis_cozinha_area_de_servico_jantar_e_jardim',
'industria_comercio_e_negocios', 'telefonia_fixa',
'construcao_ferramentas_iluminacao', 'livros_tecnicos',
'eletrodomesticos_2', 'artigos_de_festas', 'bebidas',
'market_place', 'la_cuisine', 'construcao_ferramentas_jardim',
'fashion_roupa_feminina', 'casa_conforto', 'audio',
'alimentos_bebidas', 'musica', 'alimentos',
'tablets_impressao_imagem', 'livros_importados',
'portateis_casa_forno_e_cafe', 'fashion_esporte',
'artigos_de_natal', 'fashion_roupa_infanto_juvenil',
'dvds_blu_ray', 'artes_e_artesanato', 'pc_gamer', 'moveis_quarto',
'cine_foto', 'fraldas_higiene', 'flores', 'casa_conforto_2',
'portateis_cozinha_e_preparadores_de_alimentos',
'seguros_e_servicos', 'moveis_colchao_e_estofado',
'cds_dvds_musicais'], dtype=object)
products_df.count()
product_id 32951
product_category_name 32341
product_name_lenght 32341
product_description_lenght 32341
product_photos_qty 32341
product_weight_g 32949
product_length_cm 32949
product_height_cm 32949
product_width_cm 32949
dtype: int64
products_df.isna().sum()
product_id 0
product_category_name 610
product_name_lenght 610
product_description_lenght 610
product_photos_qty 610
product_weight_g 2
product_length_cm 2
product_height_cm 2
product_width_cm 2
dtype: int64
products_df['product_category_name'].value_counts()
cama_mesa_banho 3029
esporte_lazer 2867
moveis_decoracao 2657
beleza_saude 2444
utilidades_domesticas 2335
...
fashion_roupa_infanto_juvenil 5
casa_conforto_2 5
pc_gamer 3
seguros_e_servicos 2
cds_dvds_musicais 1
Name: product_category_name, Length: 73, dtype: int64
sns.countplot(x='product_category_name', data=products_df)
plt.show()
- geolocation_df
geolocation_df.head()
geolocation_df.describe()
geolocation_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geolocation_zip_code_prefix 1000163 non-null int64
1 geolocation_lat 1000163 non-null float64
2 geolocation_lng 1000163 non-null float64
3 geolocation_city 1000163 non-null object
4 geolocation_state 1000163 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB
geolocation_df['geolocation_city'].unique()
array(['sao paulo', 'são paulo', 'sao bernardo do campo', ..., 'ciríaco',
'estação', 'vila lângaro'], dtype=object)
geolocation_df.count()
geolocation_zip_code_prefix 1000163
geolocation_lat 1000163
geolocation_lng 1000163
geolocation_city 1000163
geolocation_state 1000163
dtype: int64
geolocation_df.isna().sum()
geolocation_zip_code_prefix 0
geolocation_lat 0
geolocation_lng 0
geolocation_city 0
geolocation_state 0
dtype: int64
- category_df
category_df.head()
category_df.describe()
category_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_category_name 71 non-null object
1 product_category_name_english 71 non-null object
dtypes: object(2)
memory usage: 1.2+ KB
category_df['product_category_name'].unique()
array(['beleza_saude', 'informatica_acessorios', 'automotivo',
'cama_mesa_banho', 'moveis_decoracao', 'esporte_lazer',
'perfumaria', 'utilidades_domesticas', 'telefonia',
'relogios_presentes', 'alimentos_bebidas', 'bebes', 'papelaria',
'tablets_impressao_imagem', 'brinquedos', 'telefonia_fixa',
'ferramentas_jardim', 'fashion_bolsas_e_acessorios',
'eletroportateis', 'consoles_games', 'audio', 'fashion_calcados',
'cool_stuff', 'malas_acessorios', 'climatizacao',
'construcao_ferramentas_construcao',
'moveis_cozinha_area_de_servico_jantar_e_jardim',
'construcao_ferramentas_jardim', 'fashion_roupa_masculina',
'pet_shop', 'moveis_escritorio', 'market_place', 'eletronicos',
'eletrodomesticos', 'artigos_de_festas', 'casa_conforto',
'construcao_ferramentas_ferramentas', 'agro_industria_e_comercio',
'moveis_colchao_e_estofado', 'livros_tecnicos', 'casa_construcao',
'instrumentos_musicais', 'moveis_sala',
'construcao_ferramentas_iluminacao',
'industria_comercio_e_negocios', 'alimentos', 'artes',
'moveis_quarto', 'livros_interesse_geral',
'construcao_ferramentas_seguranca',
'fashion_underwear_e_moda_praia', 'fashion_esporte',
'sinalizacao_e_seguranca', 'pcs', 'artigos_de_natal',
'fashion_roupa_feminina', 'eletrodomesticos_2',
'livros_importados', 'bebidas', 'cine_foto', 'la_cuisine',
'musica', 'casa_conforto_2', 'portateis_casa_forno_e_cafe',
'cds_dvds_musicais', 'dvds_blu_ray', 'flores',
'artes_e_artesanato', 'fraldas_higiene',
'fashion_roupa_infanto_juvenil', 'seguros_e_servicos'],
dtype=object)
category_df['product_category_name_english'].unique()
array(['health_beauty', 'computers_accessories', 'auto', 'bed_bath_table',
'furniture_decor', 'sports_leisure', 'perfumery', 'housewares',
'telephony', 'watches_gifts', 'food_drink', 'baby', 'stationery',
'tablets_printing_image', 'toys', 'fixed_telephony',
'garden_tools', 'fashion_bags_accessories', 'small_appliances',
'consoles_games', 'audio', 'fashion_shoes', 'cool_stuff',
'luggage_accessories', 'air_conditioning',
'construction_tools_construction',
'kitchen_dining_laundry_garden_furniture',
'costruction_tools_garden', 'fashion_male_clothing', 'pet_shop',
'office_furniture', 'market_place', 'electronics',
'home_appliances', 'party_supplies', 'home_confort',
'costruction_tools_tools', 'agro_industry_and_commerce',
'furniture_mattress_and_upholstery', 'books_technical',
'home_construction', 'musical_instruments',
'furniture_living_room', 'construction_tools_lights',
'industry_commerce_and_business', 'food', 'art',
'furniture_bedroom', 'books_general_interest',
'construction_tools_safety', 'fashion_underwear_beach',
'fashion_sport', 'signaling_and_security', 'computers',
'christmas_supplies', 'fashio_female_clothing',
'home_appliances_2', 'books_imported', 'drinks', 'cine_photo',
'la_cuisine', 'music', 'home_comfort_2',
'small_appliances_home_oven_and_coffee', 'cds_dvds_musicals',
'dvds_blu_ray', 'flowers', 'arts_and_craftmanship',
'diapers_and_hygiene', 'fashion_childrens_clothes',
'security_and_services'], dtype=object)
category_df.count()
product_category_name 71
product_category_name_english 71
dtype: int64
category_df.isna().sum()
product_category_name 0
product_category_name_english 0
dtype: int64
- orders_df
orders_df.head()
orders_df.describe()
orders_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 99441 non-null object
1 customer_id 99441 non-null object
2 order_status 99441 non-null object
3 order_purchase_timestamp 99441 non-null object
4 order_approved_at 99281 non-null object
5 order_delivered_carrier_date 97658 non-null object
6 order_delivered_customer_date 96476 non-null object
7 order_estimated_delivery_date 99441 non-null object
dtypes: object(8)
memory usage: 6.1+ MB
orders_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 99441 non-null object
1 customer_id 99441 non-null object
2 order_status 99441 non-null object
3 order_purchase_timestamp 99441 non-null object
4 order_approved_at 99281 non-null object
5 order_delivered_carrier_date 97658 non-null object
6 order_delivered_customer_date 96476 non-null object
7 order_estimated_delivery_date 99441 non-null object
dtypes: object(8)
memory usage: 6.1+ MB
orders_df['order_delivered_carrier_date'].unique()
array(['2017-10-04 19:55:00', '2018-07-26 14:31:00',
'2018-08-08 13:50:00', ..., '2017-08-28 20:52:26',
'2018-01-12 15:35:03', '2018-03-09 22:11:59'], dtype=object)
orders_df['order_delivered_customer_date'].unique()
array(['2017-10-10 21:25:13', '2018-08-07 15:27:45',
'2018-08-17 18:06:29', ..., '2017-09-21 11:24:17',
'2018-01-25 23:32:54', '2018-03-16 13:08:30'], dtype=object)
orders_df['order_estimated_delivery_date'].unique()
array(['2017-10-18 00:00:00', '2018-08-13 00:00:00',
'2018-09-04 00:00:00', '2017-12-15 00:00:00',
'2018-02-26 00:00:00', '2017-08-01 00:00:00',
'2017-05-09 00:00:00', '2017-06-07 00:00:00',
'2017-03-06 00:00:00', '2017-08-23 00:00:00',
'2017-08-08 00:00:00', '2018-07-18 00:00:00',
'2018-08-08 00:00:00', '2018-03-21 00:00:00',
'2018-07-04 00:00:00', '2018-02-06 00:00:00',
'2018-01-29 00:00:00', '2017-12-11 00:00:00',
'2017-11-23 00:00:00', '2017-09-28 00:00:00',
'2018-03-29 00:00:00', '2018-02-21 00:00:00',
'2018-08-17 00:00:00', '2018-03-12 00:00:00',
'2018-03-28 00:00:00', '2018-05-23 00:00:00',
'2018-04-13 00:00:00', '2018-05-15 00:00:00',
'2018-01-08 00:00:00', '2018-03-07 00:00:00',
'2018-08-06 00:00:00', '2018-03-20 00:00:00',
'2017-08-22 00:00:00', '2018-07-17 00:00:00',
'2018-04-12 00:00:00', '2017-06-12 00:00:00',
'2017-12-21 00:00:00', '2017-09-01 00:00:00',
'2018-09-13 00:00:00', '2018-06-28 00:00:00',
'2017-06-09 00:00:00', '2018-05-25 00:00:00',
'2017-08-31 00:00:00', '2018-02-23 00:00:00',
'2018-07-20 00:00:00', '2018-08-16 00:00:00',
'2018-01-16 00:00:00', '2017-09-20 00:00:00',
'2018-07-16 00:00:00', '2018-07-05 00:00:00',
'2018-04-02 00:00:00', '2017-03-30 00:00:00',
'2017-07-06 00:00:00', '2017-12-18 00:00:00',
'2018-08-15 00:00:00', '2017-12-05 00:00:00',
'2018-03-13 00:00:00', '2018-02-14 00:00:00',
'2018-07-13 00:00:00', '2018-06-26 00:00:00',
'2018-08-02 00:00:00', '2017-09-25 00:00:00',
'2018-05-08 00:00:00', '2017-03-21 00:00:00',
'2017-05-12 00:00:00', '2017-10-11 00:00:00',
'2018-08-30 00:00:00', '2017-08-16 00:00:00',
'2018-01-19 00:00:00', '2017-04-27 00:00:00',
'2017-06-01 00:00:00', '2017-05-25 00:00:00',
'2017-11-21 00:00:00', '2018-01-03 00:00:00',
'2017-09-21 00:00:00', '2018-06-05 00:00:00',
'2018-02-19 00:00:00', '2018-05-16 00:00:00',
'2017-10-13 00:00:00', '2018-05-21 00:00:00',
'2018-01-22 00:00:00', '2018-05-07 00:00:00',
'2018-08-27 00:00:00', '2018-06-08 00:00:00',
'2017-04-26 00:00:00', '2018-07-23 00:00:00',
'2017-06-06 00:00:00', '2018-08-21 00:00:00',
'2018-03-26 00:00:00', '2017-03-10 00:00:00',
'2017-07-25 00:00:00', '2017-10-16 00:00:00',
'2017-12-22 00:00:00', '2018-09-05 00:00:00',
'2018-08-10 00:00:00', '2018-05-29 00:00:00',
'2017-12-19 00:00:00', '2017-10-17 00:00:00',
'2017-07-10 00:00:00', '2018-05-04 00:00:00',
'2018-05-14 00:00:00', '2017-08-04 00:00:00',
'2017-10-03 00:00:00', '2017-12-14 00:00:00',
'2017-10-31 00:00:00', '2018-01-04 00:00:00',
'2018-04-20 00:00:00', '2018-03-08 00:00:00',
'2018-07-30 00:00:00', '2017-04-17 00:00:00',
'2017-07-28 00:00:00', '2018-06-04 00:00:00',
'2018-07-19 00:00:00', '2018-03-16 00:00:00',
'2018-01-31 00:00:00', '2017-05-29 00:00:00',
'2017-12-27 00:00:00', '2018-06-12 00:00:00',
'2017-12-20 00:00:00', '2018-03-09 00:00:00',
'2017-06-05 00:00:00', '2018-02-07 00:00:00',
'2017-06-08 00:00:00', '2017-08-11 00:00:00',
'2018-07-27 00:00:00', '2018-07-25 00:00:00',
'2017-04-11 00:00:00', '2017-11-09 00:00:00',
'2018-04-11 00:00:00', '2018-07-11 00:00:00',
'2017-09-27 00:00:00', '2018-04-26 00:00:00',
'2018-02-15 00:00:00', '2018-05-02 00:00:00',
'2017-10-20 00:00:00', '2017-05-15 00:00:00',
'2018-02-02 00:00:00', '2017-04-10 00:00:00',
'2018-08-23 00:00:00', '2017-07-18 00:00:00',
'2017-08-07 00:00:00', '2017-08-03 00:00:00',
'2017-07-14 00:00:00', '2018-06-06 00:00:00',
'2018-08-09 00:00:00', '2017-08-21 00:00:00',
'2018-07-31 00:00:00', '2017-03-28 00:00:00',
'2018-02-01 00:00:00', '2018-05-03 00:00:00',
'2017-06-16 00:00:00', '2017-12-26 00:00:00',
'2017-06-28 00:00:00', '2017-10-04 00:00:00',
'2018-05-11 00:00:00', '2017-10-27 00:00:00',
'2018-03-06 00:00:00', '2017-12-06 00:00:00',
'2017-06-26 00:00:00', '2018-04-19 00:00:00',
'2018-05-28 00:00:00', '2018-05-09 00:00:00',
'2017-05-11 00:00:00', '2017-12-13 00:00:00',
'2018-01-24 00:00:00', '2018-03-22 00:00:00',
'2018-04-24 00:00:00', '2017-02-13 00:00:00',
'2017-05-10 00:00:00', '2018-07-12 00:00:00',
'2018-04-27 00:00:00', '2017-03-16 00:00:00',
'2018-03-05 00:00:00', '2017-12-12 00:00:00',
'2018-02-08 00:00:00', '2017-03-17 00:00:00',
'2018-07-24 00:00:00', '2017-10-30 00:00:00',
'2018-02-22 00:00:00', '2018-05-30 00:00:00',
'2018-03-23 00:00:00', '2018-04-16 00:00:00',
'2018-05-24 00:00:00', '2018-04-05 00:00:00',
'2018-04-03 00:00:00', '2018-02-20 00:00:00',
'2017-11-27 00:00:00', '2018-03-01 00:00:00',
'2018-08-14 00:00:00', '2017-07-19 00:00:00',
'2018-04-17 00:00:00', '2018-08-03 00:00:00',
'2018-04-06 00:00:00', '2018-04-09 00:00:00',
'2017-03-02 00:00:00', '2017-10-23 00:00:00',
'2018-01-02 00:00:00', '2017-06-23 00:00:00',
'2018-01-30 00:00:00', '2017-09-13 00:00:00',
'2018-07-03 00:00:00', '2016-12-09 00:00:00',
'2017-08-17 00:00:00', '2018-01-05 00:00:00',
'2018-08-24 00:00:00', '2018-02-05 00:00:00',
'2018-05-18 00:00:00', '2018-07-26 00:00:00',
'2017-09-04 00:00:00', '2018-08-20 00:00:00',
'2018-09-21 00:00:00', '2018-03-19 00:00:00',
'2018-09-12 00:00:00', '2018-08-28 00:00:00',
'2017-11-08 00:00:00', '2017-05-19 00:00:00',
'2018-04-25 00:00:00', '2018-01-17 00:00:00',
'2017-11-07 00:00:00', '2017-11-14 00:00:00',
'2017-11-29 00:00:00', '2017-04-03 00:00:00',
'2017-07-11 00:00:00', '2017-06-29 00:00:00',
'2018-06-14 00:00:00', '2016-12-07 00:00:00',
'2017-04-25 00:00:00', '2017-11-17 00:00:00',
'2018-08-22 00:00:00', '2017-07-05 00:00:00',
'2017-05-18 00:00:00', '2017-12-07 00:00:00',
'2018-01-12 00:00:00', '2017-05-04 00:00:00',
'2017-11-06 00:00:00', '2017-09-18 00:00:00',
'2017-05-31 00:00:00', '2018-01-26 00:00:00',
'2018-01-23 00:00:00', '2017-11-03 00:00:00',
'2017-10-02 00:00:00', '2017-08-14 00:00:00',
'2018-09-18 00:00:00', '2017-07-04 00:00:00',
'2017-08-29 00:00:00', '2017-10-09 00:00:00',
'2018-04-04 00:00:00', '2017-12-08 00:00:00',
'2017-11-01 00:00:00', '2018-06-29 00:00:00',
'2017-06-27 00:00:00', '2018-03-27 00:00:00',
'2017-08-28 00:00:00', '2017-03-09 00:00:00',
'2017-05-05 00:00:00', '2017-03-24 00:00:00',
'2018-03-02 00:00:00', '2017-03-23 00:00:00',
'2017-02-20 00:00:00', '2017-06-14 00:00:00',
'2017-12-01 00:00:00', '2018-01-09 00:00:00',
'2018-04-23 00:00:00', '2017-07-31 00:00:00',
'2018-06-19 00:00:00', '2017-04-12 00:00:00',
'2018-08-31 00:00:00', '2017-07-17 00:00:00',
'2017-03-14 00:00:00', '2017-09-29 00:00:00',
'2018-05-22 00:00:00', '2017-10-10 00:00:00',
'2017-04-20 00:00:00', '2017-03-03 00:00:00',
'2017-12-29 00:00:00', '2018-06-07 00:00:00',
'2018-01-18 00:00:00', '2018-02-09 00:00:00',
'2017-05-16 00:00:00', '2017-09-08 00:00:00',
'2018-06-13 00:00:00', '2018-06-21 00:00:00',
'2018-02-16 00:00:00', '2017-08-25 00:00:00',
'2017-08-15 00:00:00', '2017-11-10 00:00:00',
'2018-06-20 00:00:00', '2018-06-01 00:00:00',
'2018-03-15 00:00:00', '2017-12-28 00:00:00',
'2017-03-29 00:00:00', '2017-11-22 00:00:00',
'2018-03-14 00:00:00', '2018-05-17 00:00:00',
'2017-10-19 00:00:00', '2018-08-01 00:00:00',
'2017-05-08 00:00:00', '2017-04-24 00:00:00',
'2017-03-27 00:00:00', '2017-07-12 00:00:00',
'2017-07-27 00:00:00', '2018-01-11 00:00:00',
'2017-06-02 00:00:00', '2017-05-30 00:00:00',
'2017-09-05 00:00:00', '2017-03-13 00:00:00',
'2017-12-04 00:00:00', '2016-11-29 00:00:00',
'2017-02-16 00:00:00', '2017-10-05 00:00:00',
'2017-08-02 00:00:00', '2017-11-16 00:00:00',
'2018-05-10 00:00:00', '2018-06-11 00:00:00',
'2017-06-30 00:00:00', '2018-04-10 00:00:00',
'2017-09-06 00:00:00', '2017-06-19 00:00:00',
'2017-11-13 00:00:00', '2017-04-13 00:00:00',
'2017-05-02 00:00:00', '2018-08-07 00:00:00',
'2018-01-15 00:00:00', '2017-04-04 00:00:00',
'2018-08-29 00:00:00', '2017-04-07 00:00:00',
'2018-06-18 00:00:00', '2017-08-10 00:00:00',
'2017-05-17 00:00:00', '2017-08-09 00:00:00',
'2017-09-26 00:00:00', '2017-08-30 00:00:00',
'2017-04-06 00:00:00', '2017-05-26 00:00:00',
'2018-09-20 00:00:00', '2017-07-13 00:00:00',
'2017-06-20 00:00:00', '2017-07-24 00:00:00',
'2017-04-05 00:00:00', '2018-09-17 00:00:00',
'2017-05-23 00:00:00', '2017-11-28 00:00:00',
'2017-08-18 00:00:00', '2017-05-03 00:00:00',
'2017-09-14 00:00:00', '2017-11-30 00:00:00',
'2017-07-20 00:00:00', '2017-05-24 00:00:00',
'2017-02-22 00:00:00', '2018-04-30 00:00:00',
'2017-10-25 00:00:00', '2017-07-03 00:00:00',
'2018-09-03 00:00:00', '2017-09-22 00:00:00',
'2017-03-07 00:00:00', '2017-06-21 00:00:00',
'2018-09-14 00:00:00', '2017-11-24 00:00:00',
'2017-03-22 00:00:00', '2017-02-17 00:00:00',
'2017-07-21 00:00:00', '2017-08-24 00:00:00',
'2017-07-07 00:00:00', '2017-09-11 00:00:00',
'2017-06-22 00:00:00', '2017-09-12 00:00:00',
'2017-06-13 00:00:00', '2017-10-24 00:00:00',
'2017-10-06 00:00:00', '2017-02-24 00:00:00',
'2016-11-30 00:00:00', '2017-03-01 00:00:00',
'2016-12-01 00:00:00', '2017-02-28 00:00:00',
'2017-04-19 00:00:00', '2018-09-11 00:00:00',
'2017-03-20 00:00:00', '2018-04-18 00:00:00',
'2018-01-10 00:00:00', '2018-10-17 00:00:00',
'2018-06-25 00:00:00', '2018-09-06 00:00:00',
'2018-09-10 00:00:00', '2017-09-15 00:00:00',
'2017-03-08 00:00:00', '2017-03-31 00:00:00',
'2017-10-26 00:00:00', '2017-04-18 00:00:00',
'2017-09-19 00:00:00', '2018-09-19 00:00:00',
'2018-09-25 00:00:00', '2018-06-27 00:00:00',
'2017-07-26 00:00:00', '2017-05-22 00:00:00',
'2016-11-16 00:00:00', '2017-02-15 00:00:00',
'2017-03-15 00:00:00', '2018-09-27 00:00:00',
'2016-11-25 00:00:00', '2016-10-28 00:00:00',
'2016-10-20 00:00:00', '2018-10-02 00:00:00',
'2016-12-05 00:00:00', '2017-02-01 00:00:00',
'2016-12-08 00:00:00', '2018-09-28 00:00:00',
'2016-11-28 00:00:00', '2018-06-22 00:00:00',
'2018-10-10 00:00:00', '2017-02-21 00:00:00',
'2017-04-28 00:00:00', '2017-02-23 00:00:00',
'2017-02-07 00:00:00', '2018-07-02 00:00:00',
'2017-02-27 00:00:00', '2016-11-23 00:00:00',
'2016-11-18 00:00:00', '2016-12-12 00:00:00',
'2016-12-14 00:00:00', '2018-10-01 00:00:00',
'2018-09-26 00:00:00', '2016-09-30 00:00:00',
'2016-12-02 00:00:00', '2018-10-15 00:00:00',
'2018-09-24 00:00:00', '2017-02-14 00:00:00',
'2016-11-24 00:00:00', '2018-10-03 00:00:00',
'2016-12-06 00:00:00', '2017-01-09 00:00:00',
'2018-10-04 00:00:00', '2017-04-14 00:00:00',
'2018-02-13 00:00:00', '2016-11-07 00:00:00',
'2016-11-14 00:00:00', '2018-10-05 00:00:00',
'2016-10-04 00:00:00', '2018-10-16 00:00:00',
'2016-12-13 00:00:00', '2018-10-11 00:00:00',
'2018-10-25 00:00:00', '2016-12-16 00:00:00',
'2016-11-17 00:00:00', '2016-12-20 00:00:00',
'2017-01-19 00:00:00', '2017-02-09 00:00:00',
'2016-10-24 00:00:00', '2016-12-30 00:00:00',
'2017-02-10 00:00:00', '2018-07-06 00:00:00',
'2018-10-30 00:00:00', '2018-10-23 00:00:00',
'2018-11-12 00:00:00', '2016-12-19 00:00:00',
'2016-12-23 00:00:00', '2017-01-11 00:00:00',
'2016-10-25 00:00:00', '2018-07-10 00:00:00',
'2016-10-27 00:00:00'], dtype=object)
orders_df.count()
order_id 99441
customer_id 99441
order_status 99441
order_purchase_timestamp 99441
order_approved_at 99281
order_delivered_carrier_date 97658
order_delivered_customer_date 96476
order_estimated_delivery_date 99441
dtype: int64
orders_df.isna().sum()
order_id 0
customer_id 0
order_status 0
order_purchase_timestamp 0
order_approved_at 160
order_delivered_carrier_date 1783
order_delivered_customer_date 2965
order_estimated_delivery_date 0
dtype: int64
- payments_df
payments_df.head()
payments_df.describe()
payments_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 103886 non-null object
1 payment_sequential 103886 non-null int64
2 payment_type 103886 non-null object
3 payment_installments 103886 non-null int64
4 payment_value 103886 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB
payments_df['payment_type'].unique()
array(['credit_card', 'boleto', 'voucher', 'debit_card', 'not_defined'],
dtype=object)
payments_df.count()
order_id 103886
payment_sequential 103886
payment_type 103886
payment_installments 103886
payment_value 103886
dtype: int64
payments_df.isna().sum()
order_id 0
payment_sequential 0
payment_type 0
payment_installments 0
payment_value 0
dtype: int64
'BOOTCAMP > 프로그래머스 인공지능 데브코스' 카테고리의 다른 글
[5주차 - Day2] Web Application with Django (0) | 2023.04.20 |
---|---|
[4주차 - Day5] 과제 (0) | 2023.04.19 |
[4주차 - Day3] 탐색적 데이터 분석 - EDA (0) | 2023.04.06 |
[4주차 - Day2] 클라우드를 활용한 머신러닝 모델 (0) | 2023.04.06 |
[4주차 - Day1] Web Application with Flask (0) | 2023.04.05 |