데이터 과학 기반의 파이썬 빅데이터 분석 Chapter06 파이썬 크롤링 - 라이브러리 이용

2023. 1. 6. 23:57Python/데이터 과학 기반의 파이썬 빅데이터 분석(한빛 아카데미)

정적 웹 페이지 크롤링 준비

1.1 BeautifulSoup 연습하기 1

from bs4 import BeautifulSoup
  • 연습용 html 작성
html = '<h1 id="title">한빛출판네트워크</h1><div class="top"><ul class="menu"><li><a href="http://wwww.hanbit.co.kr/member/login.html"class="login">로그인</a></li></ul><ul class="brand"><li><a href="http://www.hanbit.co.kr/media/">한빛미디어</a></li><li><a href="http://www.hanbit.co.kr/academy/">한빛아카데미</a></li></ul></div>'
  • BeutifulSoup 객체 생성
soup = BeautifulSoup(html, 'html.parser')
  • 객체에 저장된 html 내용 확인
print(soup.prettify())
<h1 id="title">
 한빛출판네트워크
</h1>
<div class="top">
 <ul class="menu">
  <li>
   <a class="login" href="http://wwww.hanbit.co.kr/member/login.html">
    로그인
   </a>
  </li>
 </ul>
 <ul class="brand">
  <li>
   <a href="http://www.hanbit.co.kr/media/">
    한빛미디어
   </a>
  </li>
  <li>
   <a href="http://www.hanbit.co.kr/academy/">
    한빛아카데미
   </a>
  </li>
 </ul>
</div>

1.2 BeautifulSoup 연습하기 2

  1. 태그 파싱하기: 지정된 한 개의 태그만 파싱 한다.
soup.h1
<h1 id="title">한빛출판네트워크</h1>

tag_h1 = soup.h1
tag_h1
<h1 id="title">한빛출판네트워크</h1>

tag_div = soup.div
tag_div

<div class="top"><ul class="menu"><li><a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a></li></ul><ul class="brand"><li><a href="http://www.hanbit.co.kr/media/">한빛미디어</a></li><li><a href="http://

tag_ul = soup.ul
tag_ul
<ul class="menu"><li><a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a></li></ul>

tag_li = soup.li
tag_li
<li><a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a></li>

tag_a = soup.a
tag_a
<a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a>
  • 지정된 태그를 모두 파싱 하여 리스트를 구성한다.
tag_ul_all = soup.find_all("ul")
tag_ul_all
[<ul class="menu"><li><a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a></li></ul>,
 <ul class="brand"><li><a href="http://www.hanbit.co.kr/media/">한빛미디어</a></li><li><a href="http://www.hanbit.co.kr/academy/">한빛아카데미</a></li></ul>]
 
tag_li_all = soup.find_all("li")
tag_li_all
[<li><a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a></li>,
 <li><a href="http://www.hanbit.co.kr/media/">한빛미디어</a></li>,
 <li><a href="http://www.hanbit.co.kr/academy/">한빛아카데미</a></li>]
 
tag_a_all = soup.find_all("a")
tag_a_all
[<a class="login" href="http://wwww.hanbit.co.kr/member/login.html">로그인</a>,
 <a href="http://www.hanbit.co.kr/media/">한빛미디어</a>,
 <a href="http://www.hanbit.co.kr/academy/">한빛아카데미</a>]
  • 다음과 같은 속성을 이용하여 파싱 할 수도 있다.
  • 1) attrs: 속성 이름과 속성값으로 딕셔너리 구성
  • 2) find(): 속성을 이용하여 특정 태그 파싱
  • 3) select(): 지정한 태그를 모두 파싱 하여 리스트 구성
  • 태그#id 속성값
  • 태그. class 속성값
tag_a.attrs
{'href': 'http://wwww.hanbit.co.kr/member/login.html', 'class': ['login']}

tag_a['href']
http://wwww.hanbit.co.kr/member/login.html

tag_a['class']
['login']

tag_ul_2 = soup.find('ul', attrs={'class':'brand'})
tag_ul_2
<ul class="brand"><li><a href="http://www.hanbit.co.kr/media/">한빛미디어</a></li><li><a href="http://www.hanbit.co.kr/academy/">한빛아카데미</a></li></ul>

title = soup.find(id="title")
title
<h1 id="title">한빛출판네트워크</h1>

title.string
한빛출판네트워크

li_list = soup.select("div>ul.brand>li")
li_list
[]

for li in li_list:
    print(li.string)

정적 웹 페이지 크롤링 실습

  • 파이썬 셀 창에서 크롤링하기
from bs4 import BeautifulSoup
import urllib.request

result = []

for page in range(1, 59):
    Hollys_url = 'https://www.hollys.co.kr/store/korea/korStore.do?pageNo=%d&sido=&gugun=&store=' %page
    print(Hollys_url)
    html = urllib.request.urlopen(Hollys_url)
    soupHollys = BeautifulSoup(html, 'html.parser')
    tag_tbody = soupHollys.find('tbody')
    for store in tag_tbody.find_all('tr'):
        if len(store) <= 3:
            break
        store_td = store.find_all('td')
        store_name = store_td[1].string
        store_sido = store_td[0].string
        store_address = store_td[3].string
        store_phone = store_td[5].string
        result.append([store_name]+[store_sido]+[store_address]+[store_phone])
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=1&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=2&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=3&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=4&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=5&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=6&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=7&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=8&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=9&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=10&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=11&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=12&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=13&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=14&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=15&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=16&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=17&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=18&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=19&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=20&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=21&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=22&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=23&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=24&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=25&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=26&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=27&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=28&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=29&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=30&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=31&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=32&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=33&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=34&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=35&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=36&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=37&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=38&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=39&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=40&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=41&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=42&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=43&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=44&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=45&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=46&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=47&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=48&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=49&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=50&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=51&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=52&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=53&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=54&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=55&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=56&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=57&sido=&gugun=&store=
https://www.hollys.co.kr/store/korea/korStore.do?pageNo=58&sido=&gugun=&store=

len(result)
567

result[0]
['부천심곡점', '경기 부천시 소사구', '경기도 부천시 장말로 376 심곡동 355-1', '032-612-7258']

result[565]
['신촌점', '서울 서대문구', '서울시 서대문구 연세로 34 ', '02-393-2004']

store_td
[<td class="noline center_t">경기 성남시 수정구</td>,
 <td class="center_t"><a href="#" onclick="javascript:storeView(11); return false;">성남점</a></td>,
 <td class="center_t tdp0">영업중</td>,
 <td class="center_t"><a href="#" onclick="javascript:storeView(11); return false;">경기도 성남시 수정구 수정로 175, 동일빌딩1층</a></td>,
 <td class="center_t">
 <img alt="흡연시설" src="https://www.hollys.co.kr/websrc/images/store/img_store_s04.gif" style="margin-right:1px"/>
 </td>,
 <td class="center_t">031-721-6369</td>]

store_td[1].string
성남점

store_td[0].string
경기 성남시 수정구

store_td[3].string
경기도 성남시 수정구 수정로 175, 동일빌딩1층

store_td[5].string
031-721-6369

크롤링한 데이터 저장하기

import pandas as pd
hollys_tbl = pd.DataFrame(result, columns = ('store', 'sido-gu', 'address', 'phone'))
hollys_tbl.to_csv("hollys.csv", encoding = "cp949", mode = "w", index = True)

2. 동적 웹 페이지 크롤링

!pip install selenium
!apt-get update
!apt install chromium-chromedriver
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: selenium in /usr/local/lib/python3.8/dist-packages (4.7.2)
Requirement already satisfied: certifi>=2021.10.8 in /usr/local/lib/python3.8/dist-packages (from selenium) (2022.12.7)
Requirement already satisfied: trio~=0.17 in /usr/local/lib/python3.8/dist-packages (from selenium) (0.22.0)
Requirement already satisfied: urllib3[socks]~=1.26 in /usr/local/lib/python3.8/dist-packages (from selenium) (1.26.13)
Requirement already satisfied: trio-websocket~=0.9 in /usr/local/lib/python3.8/dist-packages (from selenium) (0.9.2)
Requirement already satisfied: exceptiongroup>=1.0.0rc9 in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (1.1.0)
Requirement already satisfied: idna in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (2.10)
Requirement already satisfied: sniffio in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (1.3.0)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (22.2.0)
Requirement already satisfied: async-generator>=1.9 in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (1.10)
Requirement already satisfied: outcome in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (1.2.0)
Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.8/dist-packages (from trio~=0.17->selenium) (2.4.0)
Requirement already satisfied: wsproto>=0.14 in /usr/local/lib/python3.8/dist-packages (from trio-websocket~=0.9->selenium) (1.2.0)
Requirement already satisfied: PySocks!=1.5.7,<2.0,>=1.5.6 in /usr/local/lib/python3.8/dist-packages (from urllib3[socks]~=1.26->selenium) (1.7.1)
Requirement already satisfied: h11<1,>=0.9.0 in /usr/local/lib/python3.8/dist-packages (from wsproto>=0.14->trio-websocket~=0.9->selenium) (0.14.0)
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [102 kB]
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:11 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Get:14 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [1,389 kB]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [3,127 kB]
Get:17 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [43.2 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [30.9 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3,549 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,348 kB]
Get:21 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,573 kB]
Get:22 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [1,349 kB]
Fetched 13.8 MB in 54s (256 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra
Suggested packages:
  webaccounts-chromium-extension unity-chromium-extension
The following NEW packages will be installed:
  chromium-browser chromium-browser-l10n chromium-chromedriver
  chromium-codecs-ffmpeg-extra
0 upgraded, 4 newly installed, 0 to remove and 27 not upgraded.
Need to get 95.6 MB of archives.
After this operation, 321 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-codecs-ffmpeg-extra amd64 108.0.5359.71-0ubuntu0.18.04.5 [1,159 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser amd64 108.0.5359.71-0ubuntu0.18.04.5 [83.6 MB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser-l10n all 108.0.5359.71-0ubuntu0.18.04.5 [5,230 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-chromedriver amd64 108.0.5359.71-0ubuntu0.18.04.5 [5,594 kB]
Fetched 95.6 MB in 10s (9,347 kB/s)
Selecting previously unselected package chromium-codecs-ffmpeg-extra.
(Reading database ... 124016 files and directories currently installed.)
Preparing to unpack .../chromium-codecs-ffmpeg-extra_108.0.5359.71-0ubuntu0.18.04.5_amd64.deb ...
Unpacking chromium-codecs-ffmpeg-extra (108.0.5359.71-0ubuntu0.18.04.5) ...
Selecting previously unselected package chromium-browser.
Preparing to unpack .../chromium-browser_108.0.5359.71-0ubuntu0.18.04.5_amd64.deb ...
Unpacking chromium-browser (108.0.5359.71-0ubuntu0.18.04.5) ...
Selecting previously unselected package chromium-browser-l10n.
Preparing to unpack .../chromium-browser-l10n_108.0.5359.71-0ubuntu0.18.04.5_all.deb ...
Unpacking chromium-browser-l10n (108.0.5359.71-0ubuntu0.18.04.5) ...
Selecting previously unselected package chromium-chromedriver.
Preparing to unpack .../chromium-chromedriver_108.0.5359.71-0ubuntu0.18.04.5_amd64.deb ...
Unpacking chromium-chromedriver (108.0.5359.71-0ubuntu0.18.04.5) ...
Setting up chromium-codecs-ffmpeg-extra (108.0.5359.71-0ubuntu0.18.04.5) ...
Setting up chromium-browser (108.0.5359.71-0ubuntu0.18.04.5) ...
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/x-www-browser (x-www-browser) in auto mode
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/gnome-www-browser (gnome-www-browser) in auto mode
Setting up chromium-chromedriver (108.0.5359.71-0ubuntu0.18.04.5) ...
Setting up chromium-browser-l10n (108.0.5359.71-0ubuntu0.18.04.5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.6) ...

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import datetime

from selenium import webdriver
import time

#[CODE1]
def CoffeeBean_store(result):
    CoffeeBean_URL = "https://www.coffeebeankorea.com/store/store.asp"
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless') #내부 창을 띄울 수 없으므로 설정
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.get("http://naver.com")

    for i in range(1, 370): #매장 수만큼 반복
        wd.get(CoffeeBean_URL)
        time.sleep(1)   #웹페이지 연결할 동안 1초 대기
        try:
            wd.execute_script("storePop2(%d)"%i)
            time.sleep(1)   #스크립트 실행할 동안 1초 대기
            html = wd.page_source
            soupCB = BeautifulSoup(html, 'html.parser')
            store_name_h2 = soupCB.select("div.store_txt > h2")
            store_name = store_name_h2[0].string
            print(store_name)   #매장 이름 출력하기
            store_info = soupCB.select("div.store_txt > table.store_table > tbody > tr > td")
            store_address_list = list(store_info[2])
            store_address = store_address_list[0]
            store_phone = store_info[3].string
            result.append([store_name]+[store_address]+[store_phone])
        except:
            continue
    return

#[CODE0]
def main():
    result = []
    print('CoffeeBean store crawling >>>>>>>>>>>>>>>')
    CoffeeBean_store(result)    #[CODE1]

    CB_tbl = pd.DataFrame(result, columns = ('store', 'address', 'phone'))
    CB_tbl.to_csv('./CoffeeBean.csv', encoding = 'cp949', mode = 'w', index = True)

if __name__ == '__main__':
    main()
양천향교역점
성균관대 자연과학캠퍼스점
마곡마커스빌딩점 
방이역4번출구점
IFC MALL 3호점
동탄능동DT점
수원인계DT점
인천공항 제2여객터미널 입구점
부산서면서전로점
분당차병원본관앞점
신용산역래미안점
선릉역3번출구세방빌딩점
광명아브뉴프랑점
새문안로점
문정동환인제약빌딩점
경희대점
파주DT점
창원시티세븐점
종로구청앞점
상수역점
안양시청앞점
하남신세계백화점 B1점
종로센트로폴리스점
잠실향군타워점
부산서면점
영종도운서역점
동대입구역점
해운대센텀호텔점
교대법원점
부천북부역사거리점
부산남포역점
광주상무역점
김포장기DT점
서초동점
남양주화도DT점
선정릉역점
화곡역점
부천스타필드시티점
광주문화전당역점
대구228공원앞점
삼성루첸타워점
수원AK플라자점
하나은행을지로본점
선릉KSA한국표준협회점
논현역6번출구앞점
이태원몬드리안호텔점
광주봉선점
KB국민은행노원역점
수서역원방빌딩점
송파헬리오시티점
청계천을지로점
서초중앙로점
일산동구청점
부산화명수정역점
용인역북점
부산중앙역점
천안신부동점
제주아라점