IoT篳路藍縷: [Python] 網路爬蟲學習重點節錄

選擇用書Python網路爬蟲與資料視覺化應用實務(旗標出版社)
Request

import requests

requests.get參數
r = requests.get(url , timeout=0.3) //設定time out
r = requests.get(url, cookies={"over18": "1"}) 透過cookies儲存超過18歲來通過訪問ptt
nextPg = response.xpath(xpath)
url = "http://www.google.com"
timeout = 0.3
'maxResult' : 5,
headers=url_headers

例外處理 :
RequestException 請求錯誤
HTTPError 不合法回應
ConnectionError 連線錯誤
Timeout 逾時
TooManyRedirects 超過轉址值

bs4(BeautifulSoup) + request混合使用

from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.google.com")
soup = BeautifulSoup(r.text, "lxml")
print(soup)

若soup要寫入檔案必須格式化需要呼叫
prettify()
範例 :
fp = open ("xxx.txt", "w", encoding="utf8")
fp.write(soup.prettify())
print("writting")
fp.close()

============================================================
find()

find() 尋找單一條件
find_all()遍尋整個HTML網頁

find_all()加入限制條件 = >
soup.find_all("tag", class_name="class_name", limit=limit_num )
(tag, id, limit=)

============================================================
Ch4
Select()

BeautifulSoup使用select與select_one
select() 是清單式列出條件符合結果
select_one()是只列一個
css selector使用nth-of-type因此"nth-" 須改寫成=>nth-of-type
如tag.select("div:nth-of-type(3)")

============================================================
Ch5
遍歷祖先兄弟子孫

tag.content與tag.children 都可以取得子標籤
並透過for chile in tag.descendants 取得之下的所有子孫標籤
for tag in tag_ui chile //for子孫標籤
for tag in tag_ui.parent //for祖先標籤

next_sibling 兄弟標籤走訪
previous_sibling

find_previous_sibling() //找上一個兄弟(此函數會自動跳過NavigableString物件)
find_next_sibling() //找下一個兄弟(此函數會自動跳過NavigableString物件)

走訪上下元素
next_element //下一個元素
previous_elements //上一個元素
用法
next_tag = tag.next_element
print(type(next_tag), next_tag.name)

===========================================================
修改
修改僅修改python物件樹並不會修改網頁原始碼
tag.string = "str_value" //修改
del tag["tag_element"] //刪除
new_tag = tag.new_tag("tag's context") //新增1
tag.append(new_tag) //新增2
insert_before //插入在xxx之前
insert_after //插入在xxx之後

Ch5-4 CSV & Json存取
寫入CSV
with open (csv_file, 'w+', newline="")as fp:
writer = csv.writer(fp)
writer =.writerow(["Data1","Data2","Data3"]) //建立Excel檔Data1~Data3的欄位頭

Json
j_str = json.dump(json_data) //json_data字典轉j_str資料字串
json_data2 = json.loads(j_str) //j_str資料字串轉json_data2字典

j_file = "jfile.json"
with open(j_file, 'r') as fp: //讀取j_file
json.dump(data, fp) 讀data寫入至fp

Ch5-5載圖
response = request.get(url, stream=True)
with open (path, 'wb') as fp:
for chunk in response:
fp.write(chunk)

或用urllib.request.urlopen(url) 開啟下載 //較有效率考量

===========================================================
Ch6 XPath 與 lxml 應用
from lxml import html

找到tag中的資源網址後點三個點點左鍵 copy... XPath 如下圖:

tag_img = tree.xpath("/html/body/section[2]/table/tbody/tr[21]/td[1]/a/img")[0]
print(tag_img.attrib["src"])

=>https://www.flag.com.tw/assets/img/bookpic/F4473.jpg

6-4XPath 基本語法軸::節點測試[謂詞]
* : 萬用字元，表所有符合元素和屬性節點
軸 : 開始點 (軸定義請詳閱)
節點測試 : 此軸符合的節點有哪些
謂詞Predicates : 進一步過濾條件
以書中範例/child::library/child::book[2]/child::author
/child::library 代表根後子節點為library
/child::book[2] 子節點為book的第二筆
/child::author 子節點為author

XPath Helper點選按住shift 把想查詢的地方選取即可完成

====================================================================
Ch7 Selenium

安裝selenium python -m pip install -U selenium

並安裝chrome driver : https://sites.google.com/a/chromium.org/chromedriver/downloads

from selenium import webdriver //import webdriver

driver = webdriver.Chrome("./chromedriver")

driver.get(url)
tag = driver.fine_element_by_tag_name("tag")
soup = BeautifulSoup(driver.page_source, "lxml")

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

執行google搜尋
keyword = driver.find_element_by_css_selector("#lst-ib")
keyword.send_keys("search term")
keyword.send_keys(keys.ENTER);

Keys 按鍵常數清單
https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/Keys.html

Selenium Action Chains 產生一系列網頁自動操作
click() //點選元素
click_and_hold() //按住左鍵
context_click() //按住右鍵
double_click() //按兩下元素
move_to_element() //移到元素中間
key_up() //放開某鍵
key_down() //按下某鍵
perform() //執行所有儲存動作
semd_keys() //送出按鍵至目前的元素
release() //鬆開滑鼠按鍵

act = ActionChains(driver)
act.move_to_element(ele)
act.click(eee)
act.perform()

===============================================================
Ch8

Anaconda3/Anaconda Prompt執行conda install -c conda-forge scrapy

===============================================================
Ch9

爬蟲程式的選擇
爬取數頁網頁 => BeautifulSoup
動態網頁爬取 => Selenium
整個Web網站的大量資料 => Scrapy框架

9-1透過urljoin搭配.format來對網址序列化
例如 :
catalog = ["2317", "3018", "2308"]

for i in catalog

url = urljoin("https://tw.stock.yahoo.com/q/q?s={0}".format(i)
print(url)

===============================================================
Ch10
SQL語句
SELECT * FROM
WHERE 條件 LIKE 其中LIKE有包含的意思 %a% 包含a的字串

import pymysql
db = pymysql.connect("localhost", "root", "", "mybooks", charset="utf8")
cursor = db.cursor() //取出
cursor.execute("SELECT * FROM books") //取出

sql = """INSERT abc (......)VALUES(......)""" //存入
try:
cursor.execute(sql)//執行
db.commit()//確定送出

====================================================================
歐萊禮補充
Ch1
from urllib.request import urlopen
html = urlopen("http://xxx/xxx.html")
print(html.read())
python2.x => urllib2
python3.x => urllib

Ch2 from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://xxx.xxx/xxx.html")
bsObj = BeautifulSoup(html)
用findAll函式
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())

.get_text 會從文件裡過濾掉所有標籤回傳紙包含文字內容的字串
.get_text通常都是最後一個步驟通常是列印儲存或處理最終資料時才做
通常在還沒到最後處理前盡可能保有標籤資訊
class 是python保留字所以直接寫會引發錯誤
可以使用bs解法 class_ 或將class包在引號裡 => .findAll("span", {"class":"green"})

child
for child in bsObj.find("table", {"id":"giftList"}).children:
print(child)

RE 正規表達式
aa*
bbbbb 連續五個b
(cc)* 一對c可出現任何次數
@ 必須出現@ 而且必須出現一次
[A-Za-z]+ 只能使用字母至少必須有一個字母

RegexPal : http://regexpal.com/

bs + 正規表達式範例 :
from bs4 import BeautifulSoup
html = urlopen("http://xxx.xxx/xxx.html")
bsObj = BeautifulSoup(html)
imges = bsObj.findAll("img", {"src" : re.compile("\.\./img\/gifts/img.*\.jpg")})
for imge im imges:
print(imge["src"])

ch3
python遞迴深度限制是1000次像wiki連結網路大的會當掉

重導分為
Server-side重導與Client-side重導
Server-side重導 => python 3.x的urllib會自動處理
Client-side重導 => ch10 使用java script或html達到

Scrapy 五個log紀錄層級
CRITICAL
ERROR
WARNING
DEBUG
INFO
ex. 使用LOG_LEVEL = 'ERROR'
將LOG記錄儲存到特定的檔案
scrapy crawl article -s LOG_FILE=wiki.log

ch4使用API
方法
GET => 請網站主機把某某資料拿給你
POST => 填寫表單傳送資料的動作每次登入網站會用POST送出名稱與密碼像是告訴server請將這份資料存到資料庫
PUT => 用來更新物件或資訊
DELETE => 像Server送出刪除資料要求

ch6讀取文件
讀取docx => 以open office xml標準為基礎的office word等工具
from zipfile import Zipfile
from urllib.request import urloprn
from io.import BytesIO

wordFile = urlopen("http://xxx.com/pages/xxx.docx").read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')
print(xml_content.decode('utf-8'))

wordObj = BeautifulSoup(xml_content.decode('utf-8'))
textStrings = wordObj.findAll("w:t")
for textElem in textStrings:
print(textElem.text)

Ch11 影像處理與文字辨識 p.175
透過 import subprocess
引入Tesseract
來做辨識
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"]), stdout = subprovess.PIPE,stderr=subprocess.PIPE)
p.wait()
f = open("captcha.txt", "r")

ch13以Scrapers測試你的網站
driver.get_screenshot_as_file("tmp/pythonscraping.png")

IoT篳路藍縷

[置頂文章]

2020年3月12日星期四

[Python] 網路爬蟲學習重點節錄

沒有留言:

張貼留言

搜尋此網誌

[置頂文章]

2020年3月12日 星期四

[Python] 網路爬蟲學習重點節錄

沒有留言:

張貼留言

2020年3月12日星期四