IoT篳路藍縷: 3月 2020

Ch1認識人工智慧

手寫
語音
電腦視覺
專家系統
自然語言處理
電腦遊戲
智慧機器人

機器學習處理常見分類問題 & 回歸問題
分類 : 垃圾郵件
回歸 : 預測銷售

機器學習實務上解決五種問題 :
1.分類 : 二元分類、多元分類
2.異常值判斷 : 偵測異常情況
3.預測性分析 : 迴歸
4.分群 : 找到相似度並分析
5.協助決策 : 決定下一步 ex. Alpha Go.

機器學習種類
1.監督式學習需要給標準答案
分類 : 是非題與分級
迴歸 : 預測
2.非監督是學習 : 分群。
3.半監督式學習 : 透過少量的有標籤資料增加分群正確度。
4.強化學習 : 無明確答案，需要依據資訊改變策略。

===============================================================
Ch2 建構TensorFlow與Keras開發環境

1.python -m pip install -U tensorflow
2.python -m pip install -U keras
3.python -m pip install --upgrade numpy

IDE選擇
Spyder / Jupyter Notebook
運行環境 nVIDIA GTX 1060 6GB顯卡 / Google Colaboratory雲端服務

Google Colaboratory => 到google drive新增更多裏面去新增應用程式服務
開啟GPU設定 Runtime -> change runtime type

選擇GPU

使用colaboratory 掛載 google drive外部檔案
from google.colab import drive
drive.mount("/content/drive")

透過授權允許存取google drive底下的其他檔案如.csv等等
接著透過anaconda prompt部屬keras虛擬環境
conda create --name keras anaconda

conda env remove --name keras //刪除環境

conda env list //檢查目前anaconda環境
activate keras //啟動keras

最後使用命令conda list可以瀏覽安裝清單
==================================================================
Ch3深度學習理論基礎

人工智慧包含機器學習包含深度學習 => 故深度學習是機器學習的子集合

http://playground.tensorflow.org/

圖形結構有包含頂點與邊線其中又依照邊線有無方向性分為無方向性圖形與方向性圖形

若路徑有數值則為加權圖形，此外有方向性可行程迴圈稱為方向性循環圖沒有迴圈則為方向性非循環圖

微分與偏微分
輸入(樹突) =>細胞核(神經元)=> 輸出(軸突)

人工神經元透過判斷輸入*權重加總是否大於閥值來決定是否輸出為1或0
>= 閥值 => 輸出1；反之輸出0

ANNs 類神經網路一種泛稱其中包含多層感知器MLP、卷積神經網路CNN、循環神經網路RNN

多層感知器(Multilayer Perceptron, MLP) : http://playground.tensorflow.org/就是典型的MLP也稱為前饋神經網路(Feedforward Neural Network, FNN) 是單向多層的傳統類神經網路
輸入資料->輸入層->隱藏層->...->隱藏層->輸出層->輸出資料
多層感知器是神經網路，若隱藏層超過一層(四層的神經網路稱為深度學習)

卷積神經網路(Convolutional Neural Network, CNN) : 專門處理影像辨識如分類圖片、人臉與手寫辨識等的模擬人腦視覺處理區的神經網路，也是前饋神經網路的一種，但層次上不同
輸入資料->輸入層->卷積層->池化層->...卷積層->池化層->全連接層->輸出層->輸出資料

循環神經網路(Recurrent Neural Network, RNN) : 具有短期記憶處理聲音語言與影片等序列資料的神經網路，實務上被改良的LSTM與GRU取代
輸入資料->輸入層->隱藏層->循環隱藏->輸出層->輸出資料

張量維度數 = 軸數 = 等級(Rank)
0D 純量值
1D一維陣列
2D二維陣列
3D三個維度的三維陣列 or 一維的矩陣陣列加上時間間距特性
4D 四維陣列真實特徵資料圖片圖片集等就是用4D

import numpy as np

x = np.ndim() //取得軸數
x = np.shap() //取得形狀
+ - * / 直接使用
b= a.dot(s) 點積運算

============================================================
Ch4多層感知器

線性可分割與線性不可分割問題須理解

神經元又稱感知器
回歸問題使用一個神經元
二元分類問題使用一個或兩個神經元
多元分類問題看幾類是幾個 => 數字0~9 是10個

層數與神經元數量建議
隱藏層神經元數量為輸入層到輸出層之間
神經元數量是輸入層2/3 加上輸出層的數量
隱藏層的神經元數量少於兩倍輸入層的神經元數量

監督學習有真實目標值(標準答案)又稱標籤，可以透過差異(損失分數)去調整權重來達到越來越準確的優化

神經網路訓練迴圈分為正向傳播評估損失反向傳播
正向傳播 : 計算預測值
評估損失 : 真實目標比較算出損失
反向傳播 : 計算每一層的錯誤比例，用梯度下降法更新權重‧優化器使用反向傳播計算每一層權重需要分擔損失的梯度

sigmoid會有函數微分只有1/4 若神經網路反向傳播即多個1/4相乘將會趨近於0造成梯度消失

(傳不到頭)

避免梯度消失問題可以使用ReLU 即輸入小於0時輸出為0；大於0則為線性函數

迴歸問題使用均方誤差
分類問題使用交叉鏑

選擇用書Python網路爬蟲與資料視覺化應用實務(旗標出版社)
Request

import requests

requests.get參數
r = requests.get(url , timeout=0.3) //設定time out
r = requests.get(url, cookies={"over18": "1"}) 透過cookies儲存超過18歲來通過訪問ptt
nextPg = response.xpath(xpath)
url = "http://www.google.com"
timeout = 0.3
'maxResult' : 5,
headers=url_headers

例外處理 :
RequestException 請求錯誤
HTTPError 不合法回應
ConnectionError 連線錯誤
Timeout 逾時
TooManyRedirects 超過轉址值

bs4(BeautifulSoup) + request混合使用

from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.google.com")
soup = BeautifulSoup(r.text, "lxml")
print(soup)

若soup要寫入檔案必須格式化需要呼叫
prettify()
範例 :
fp = open ("xxx.txt", "w", encoding="utf8")
fp.write(soup.prettify())
print("writting")
fp.close()

============================================================
find()

find() 尋找單一條件
find_all()遍尋整個HTML網頁

find_all()加入限制條件 = >
soup.find_all("tag", class_name="class_name", limit=limit_num )
(tag, id, limit=)

============================================================
Ch4
Select()

BeautifulSoup使用select與select_one
select() 是清單式列出條件符合結果
select_one()是只列一個
css selector使用nth-of-type因此"nth-" 須改寫成=>nth-of-type
如tag.select("div:nth-of-type(3)")

============================================================
Ch5
遍歷祖先兄弟子孫

tag.content與tag.children 都可以取得子標籤
並透過for chile in tag.descendants 取得之下的所有子孫標籤
for tag in tag_ui chile //for子孫標籤
for tag in tag_ui.parent //for祖先標籤

next_sibling 兄弟標籤走訪
previous_sibling

find_previous_sibling() //找上一個兄弟(此函數會自動跳過NavigableString物件)
find_next_sibling() //找下一個兄弟(此函數會自動跳過NavigableString物件)

走訪上下元素
next_element //下一個元素
previous_elements //上一個元素
用法
next_tag = tag.next_element
print(type(next_tag), next_tag.name)

===========================================================
修改
修改僅修改python物件樹並不會修改網頁原始碼
tag.string = "str_value" //修改
del tag["tag_element"] //刪除
new_tag = tag.new_tag("tag's context") //新增1
tag.append(new_tag) //新增2
insert_before //插入在xxx之前
insert_after //插入在xxx之後

Ch5-4 CSV & Json存取
寫入CSV
with open (csv_file, 'w+', newline="")as fp:
writer = csv.writer(fp)
writer =.writerow(["Data1","Data2","Data3"]) //建立Excel檔Data1~Data3的欄位頭

Json
j_str = json.dump(json_data) //json_data字典轉j_str資料字串
json_data2 = json.loads(j_str) //j_str資料字串轉json_data2字典

j_file = "jfile.json"
with open(j_file, 'r') as fp: //讀取j_file
json.dump(data, fp) 讀data寫入至fp

Ch5-5載圖
response = request.get(url, stream=True)
with open (path, 'wb') as fp:
for chunk in response:
fp.write(chunk)

或用urllib.request.urlopen(url) 開啟下載 //較有效率考量

===========================================================
Ch6 XPath 與 lxml 應用
from lxml import html

找到tag中的資源網址後點三個點點左鍵 copy... XPath 如下圖:

tag_img = tree.xpath("/html/body/section[2]/table/tbody/tr[21]/td[1]/a/img")[0]
print(tag_img.attrib["src"])

=>https://www.flag.com.tw/assets/img/bookpic/F4473.jpg

6-4XPath 基本語法軸::節點測試[謂詞]
* : 萬用字元，表所有符合元素和屬性節點
軸 : 開始點 (軸定義請詳閱)
節點測試 : 此軸符合的節點有哪些
謂詞Predicates : 進一步過濾條件
以書中範例/child::library/child::book[2]/child::author
/child::library 代表根後子節點為library
/child::book[2] 子節點為book的第二筆
/child::author 子節點為author

XPath Helper點選按住shift 把想查詢的地方選取即可完成

====================================================================
Ch7 Selenium

安裝selenium python -m pip install -U selenium

並安裝chrome driver : https://sites.google.com/a/chromium.org/chromedriver/downloads

from selenium import webdriver //import webdriver

driver = webdriver.Chrome("./chromedriver")

driver.get(url)
tag = driver.fine_element_by_tag_name("tag")
soup = BeautifulSoup(driver.page_source, "lxml")

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

執行google搜尋
keyword = driver.find_element_by_css_selector("#lst-ib")
keyword.send_keys("search term")
keyword.send_keys(keys.ENTER);

Keys 按鍵常數清單
https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/Keys.html

Selenium Action Chains 產生一系列網頁自動操作
click() //點選元素
click_and_hold() //按住左鍵
context_click() //按住右鍵
double_click() //按兩下元素
move_to_element() //移到元素中間
key_up() //放開某鍵
key_down() //按下某鍵
perform() //執行所有儲存動作
semd_keys() //送出按鍵至目前的元素
release() //鬆開滑鼠按鍵

act = ActionChains(driver)
act.move_to_element(ele)
act.click(eee)
act.perform()

===============================================================
Ch8

Anaconda3/Anaconda Prompt執行conda install -c conda-forge scrapy

===============================================================
Ch9

爬蟲程式的選擇
爬取數頁網頁 => BeautifulSoup
動態網頁爬取 => Selenium
整個Web網站的大量資料 => Scrapy框架

9-1透過urljoin搭配.format來對網址序列化
例如 :
catalog = ["2317", "3018", "2308"]

for i in catalog

url = urljoin("https://tw.stock.yahoo.com/q/q?s={0}".format(i)
print(url)

===============================================================
Ch10
SQL語句
SELECT * FROM
WHERE 條件 LIKE 其中LIKE有包含的意思 %a% 包含a的字串

import pymysql
db = pymysql.connect("localhost", "root", "", "mybooks", charset="utf8")
cursor = db.cursor() //取出
cursor.execute("SELECT * FROM books") //取出

sql = """INSERT abc (......)VALUES(......)""" //存入
try:
cursor.execute(sql)//執行
db.commit()//確定送出

====================================================================
歐萊禮補充
Ch1
from urllib.request import urlopen
html = urlopen("http://xxx/xxx.html")
print(html.read())
python2.x => urllib2
python3.x => urllib

Ch2 from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://xxx.xxx/xxx.html")
bsObj = BeautifulSoup(html)
用findAll函式
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())

.get_text 會從文件裡過濾掉所有標籤回傳紙包含文字內容的字串
.get_text通常都是最後一個步驟通常是列印儲存或處理最終資料時才做
通常在還沒到最後處理前盡可能保有標籤資訊
class 是python保留字所以直接寫會引發錯誤
可以使用bs解法 class_ 或將class包在引號裡 => .findAll("span", {"class":"green"})

child
for child in bsObj.find("table", {"id":"giftList"}).children:
print(child)

RE 正規表達式
aa*
bbbbb 連續五個b
(cc)* 一對c可出現任何次數
@ 必須出現@ 而且必須出現一次
[A-Za-z]+ 只能使用字母至少必須有一個字母

RegexPal : http://regexpal.com/

bs + 正規表達式範例 :
from bs4 import BeautifulSoup
html = urlopen("http://xxx.xxx/xxx.html")
bsObj = BeautifulSoup(html)
imges = bsObj.findAll("img", {"src" : re.compile("\.\./img\/gifts/img.*\.jpg")})
for imge im imges:
print(imge["src"])

ch3
python遞迴深度限制是1000次像wiki連結網路大的會當掉

重導分為
Server-side重導與Client-side重導
Server-side重導 => python 3.x的urllib會自動處理
Client-side重導 => ch10 使用java script或html達到

Scrapy 五個log紀錄層級
CRITICAL
ERROR
WARNING
DEBUG
INFO
ex. 使用LOG_LEVEL = 'ERROR'
將LOG記錄儲存到特定的檔案
scrapy crawl article -s LOG_FILE=wiki.log

ch4使用API
方法
GET => 請網站主機把某某資料拿給你
POST => 填寫表單傳送資料的動作每次登入網站會用POST送出名稱與密碼像是告訴server請將這份資料存到資料庫
PUT => 用來更新物件或資訊
DELETE => 像Server送出刪除資料要求

ch6讀取文件
讀取docx => 以open office xml標準為基礎的office word等工具
from zipfile import Zipfile
from urllib.request import urloprn
from io.import BytesIO

wordFile = urlopen("http://xxx.com/pages/xxx.docx").read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')
print(xml_content.decode('utf-8'))

wordObj = BeautifulSoup(xml_content.decode('utf-8'))
textStrings = wordObj.findAll("w:t")
for textElem in textStrings:
print(textElem.text)

Ch11 影像處理與文字辨識 p.175
透過 import subprocess
引入Tesseract
來做辨識
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"]), stdout = subprovess.PIPE,stderr=subprocess.PIPE)
p.wait()
f = open("captcha.txt", "r")

ch13以Scrapers測試你的網站
driver.get_screenshot_as_file("tmp/pythonscraping.png")

IoT篳路藍縷

[置頂文章]

2020年3月15日星期日

[Python]機器與深度學習重點節錄

2020年3月12日星期四

[Python] 網路爬蟲學習重點節錄

搜尋此網誌

[置頂文章]

2020年3月15日 星期日

[Python]機器與深度學習重點節錄

2020年3月12日 星期四

[Python] 網路爬蟲學習重點節錄

2020年3月15日星期日

2020年3月12日星期四