python爬虫

  1. request模块
  2. 数据解析
  3. 模拟登录
  4. 异步爬虫
  5. selenium
  6. scrapy框架

爬虫是模拟浏览器上网抓取数据。

通用爬虫是抓取系统的重要组成部分,抓取的是一整张页面数据;聚焦爬虫建立在通用爬虫的基础之上,抓取页面特定的局部内容;增量式爬虫检测网站中数据更新的情况,只会抓取网站中最新更新出来的数据。

反爬机制是门户网站通过制定相应的策略或技术手段,防止爬虫程序进行网站数据的爬取;反反爬策略破解门户网站中具备的反爬机制,获取数据。

robots.txt协议规定网站中可以爬取的数据范围。

http协议是服务端与客户端进行数据交互的一种形式。
请求标头:

  • User-Agent 请求载体的身份标识
  • Connection 请求完毕后是否断开连接,包括keep-alive、close
    响应标头:
  • Content-Type 服务器响应回客户端的数据类型

https协议是包含加密的,安全的超文本传输协议。

https使用的是证书密钥加密。

request模块

requests是python中原生的给予网络请求的模块,用于模拟浏览器发请求。

1
2
3
4
5
6
7
# 指定url
url = 'https://dict.baidu.com/'
# 发起请求
r = requests.get(url=url)
# 获取相应数据
page_text = r.text
# 持久化存储

参数指定:

1
2
3
url = 'https://www.sogou.com/web'
params = {'query':'特朗普'}
r = requests.get(url=url, params=params)

UA伪装:门户网站会根据请求的User-Agent来判断请求是否为爬虫,爬虫应当伪装成浏览器。

1
2
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
r = requests.get(url=url, params=params, headers=headers)

持久化存储:

1
2
3
4
5
6
7
8
9
10
url = 'https://movie.douban.com/j/chart/top_list' # 豆瓣电影排行榜
params = {'type': '17', # 科幻
'interval_id': '100:90',
'action': '',
'start': 0, # 开始标号
'limit': 20} # 个数
response = requests.get(url=url, params=params, headers=headers)
list_data = response.json()
with open('douban.json', 'w', encoding='utf-8') as fp:
json.dump(list_data, fp=fp, ensure_ascii=False)

图片保存

1
2
3
4
url = 'https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&h=920'
img_data = requests.get(url=url, headers=headers).content
with open('xiaomi.jpg', 'wb') as fp:
fp.write(img_data)

有时候文本编码有问题:

1
2
3
4
5
6
7
page_text = requests.get(url=url, headers=headers).content.decode('gbk')

page = requests.get(url=url, headers=headers)
page.encoding = 'gbk'
page_text = page.text

text = text.encode('iso-8859-1').decode('utf-8')

post请求:

1
2
3
4
5
post_url = 'https://fanyi.baidu.com/sug'
data = {'kw': 'chaos'}
response = requests.post(url=post_url, data=data, headers=headers)
dic = response.json() # 百度翻译返回的是json
# {'errno': 0, 'data': [{'k': 'chaos', 'v': 'n. 混乱; 杂乱; 紊乱;'}, {'k': 'chaos theory', 'v': 'n. 混沌理论;'}]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 'http://scxk.nmpa.gov.cn:81/xk/' 国家药品监督管理局 化妆品生产许可信息管理系统服务平台
post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
data = {'on': 'true',
'page': '1',
'pageSize': '15',
'productName': '',
'conditionType': '1',
'applyname': '',
'applysn': ''}
json_ids = requests.post(url=post_url, data=data, headers=headers).json()
# 获取公司的ID,通过ID才能获得详情数据
id_list = []
for dic in json_ids['list']:
id_list.append(dic['ID'])

# http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=ff83aff95c5541cdab5ca6e847514f88 一家公司的详情页 但信息仍然是动态加载得来的

post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
all_detail_list = []
for id in id_list:
data = {'id': id}
detail_json = requests.post(
url=post_url, data=data, headers=headers).json()
all_detail_list.append(detail_json)

数据解析

正则表达式匹配,现在要获取下述块中的图片。

1
2
3
4
5
6
7
8
9
10
<div class="swiper-slide ">
<a target="_blank" href="https://www.mi.com/a/h/18087.html" data-log_code="31pchomepagegallery000001#t=ad&amp;act=webview&amp;page=homepage&amp;page_id=10530&amp;bid=3480927.1&amp;adp=3131&amp;adm=25177">
<img class="swiper-lazy" src="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&amp;h=920" alt="" key="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&amp;h=920">
</a>
</div>
<div class="swiper-slide ">
<a target="_blank" href="https://www.mi.com/buy/detail?product_id=10000204" data-log_code="31pchomepagegallery000001#t=ad&amp;act=webview&amp;page=homepage&amp;page_id=10530&amp;bid=3480927.2&amp;adp=3132&amp;adm=25133">
<img class="swiper-lazy" data-src="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/cf6ba4d372b80e939104cf369f14139a.jpg?w=2452&amp;h=920" alt="" key="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/cf6ba4d372b80e939104cf369f14139a.jpg?w=2452&amp;h=920">
</a>
</div>
1
2
3
4
5
6
7
8
9
10
11
url = 'https://www.mi.com/'
page_text = requests.get(url=url, headers=headers).text
pattern='<img class="swiper-lazy".*?src="(.*?)".*?</div>'
img_src_list = re.findall(pattern, page_text, re.S)

pattern='mi-mall/(.*?jpg)'
for src in img_src_list:
img_data = requests.get(url=src, headers=headers).content
img_name = re.search(pattern, src).group(1)
with open(img_name, 'wb') as fp:
fp.write(img_data)

bs4是python独有的解析方式。

本地html:

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
# 本地html
with open(path, 'r', encoding='utf-8') as fp:
soup = BeautifulSoup(fp, 'lxml')
# 网页html
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')

元素查找:

1
2
3
4
5
soup.img  # 第一个img
soup.find('img') # 第一个img
soup.find_all('img') # 全部选择器
soup.find_all('img', class_='lazyload') # 属性查找# class要带下划线,避免和关键字混淆
soup.select('div.charimg > img') # 选择器返回全部

元素数据提取:

1
2
3
4
5
6
7
8
9
10
11
d.text  # 其下的全部文本
d.get_text() # 其下的全部文本
d.string # 直接包含的文本内容

imgs = soup.select('div.charimg > img')
for img in imgs:
src = img['src'] # 元素属性
img_data = requests.get(url=src, headers=headers).content
img_name = unquote(src.split('/')[-1]) # url解码为中文 import urllib.parse
with open(img_name, 'wb') as fp:
fp.write(img_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
url = 'http://www.biquge.info/24_24159/'
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')

titles = soup.select('div.box_con dl a')
with open('titles.txt', 'w', encoding='iso-8859-1') as fp:
for title in titles:
fp.write(title.string+"\n")

with open('contents.txt', 'w', encoding='iso-8859-1') as fp:
for title in titles:
new_url = url+title['href']
content_text = requests.get(url=new_url, headers=headers).text
fp.write(title.string)
pattern = '<div id="content"><!--go-->(.*?)<!--over-->'
contents = re.search(pattern, content_text, re.S).group(1)
contents = contents.replace('&nbsp;', ' ')
contents = contents.replace('<br/>', '\n')
fp.write(contents)
fp.write('\n\n\n')
fp.flush()

xpath是最常用且最便捷高效的解析方式

1
2
3
4
5
6
7
8
from lxml import etree

# 本地html
tree = etree.parse(path)

# 网页url
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)

标签定位。

1
2
3
4
5
6
r = tree.xpath('/html/head/title') # 返回Element的列表,/开头表示从根标签开始
r = tree.xpath('/html//title') # 两个斜杠//表示多个层级
r = tree.xpath('//title') # 所有的title标签
r = tree.xpath('//div[@class="classname"]') # 属性定位
r = tree.xpath('//div/p[3]') # 索引定位,从1开始
r = tree.xpath('//div/a | //div/p') # 定位两个标签

文本获取。

1
2
3
r = tree.xpath('//a/text()') # 直系文本,依然存储在列表中
r = tree.xpath('//a//text()') # 标签下的全部内容
r = tree.xpath('//img/@src') # 属性值

实例

1
2
3
4
5
6
7
8
9
url = 'https://www.58.com/ershoufang/' # 58同城二手房
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
trs=tree.xpath('//div[@id="global"]//tr')
for tr in trs:
title = tr.xpath('./td[2]/a/text()')[0] # ./从当前局部开始
price = tr.xpath('./td[3]//text()')[0]
print(title)
print(price+"万")
1
2
3
4
5
6
7
8
9
10
url = 'http://prts.wiki/w/干员一览'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
divs = tree.xpath('//div[@class="smwdata"]')
for div in divs:
img_url = div.xpath('./@data-icon')[0]
name = unquote(img_url.split('/')[-1])
img_data = requests.get(url=img_url, headers=headers).content
with open(os.path.join('icon',name), 'wb') as fp:
fp.write(img_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
url = 'https://pvp.qq.com/web201605/js/herolist.json' # 王者荣耀英雄一览
hero_list = requests.get(url=url, headers=headers).json()
for hero in hero_list:
cname = hero["cname"]
detail_url = "https://pvp.qq.com/web201605/herodetail/{}.shtml".format(ename) # 英雄详细信息
detail_page = requests.get(url=detail_url, headers=headers).content.decode('gbk')
tree = etree.HTML(detail_page)
pics = tree.xpath('//div[@class="pic-pf"]/ul/@data-imgname')[0]
pic_name_list = pics.split('|')
for i, pic_name in enumerate(pic_name_list):
pic_url = 'https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/{}/{}-bigskin-{}.jpg'.format(ename,ename, i+1) # 皮肤海报
pic_name = pic_name.split('&')[0]
img_data = requests.get(url=pic_url, headers=headers).content
with open(os.path.join('skin',cname+'_'+pic_name+'.jpg'), 'wb') as fp:
fp.write(img_data)

模拟登录

登陆时,将登陆信息post到服务端。验证码是反爬机制,对应的反反爬策略是验证码识别。

http/https协议特性是无状态,服务端不会保留用户的请求状态。cookie用来让服务端记录客户端的相关状态。

1
2
3
4
headers['cookie'] = cookie
page_text = requests.get(url, headers=headers).text
response = requests.post(url=url, data=data, headers=headers) # 携带登录信息
print(response.status_code) # 响应的状态码,200表示成功

利用session会话对象,可以自动进行cookie的获取和携带。

1
2
3
session = requests.Session()
response = session.post(url=url, data=data, headers=headers)
page_text = session.get(url=url, headers=headers).text

服务端可能会限制IP的访问次数,对应的反反爬策略是代理。代理可以突破自身IP的访问限制,还可以隐藏自身的IP。

1
2
proxies={'https':'121.230.55.133:9999'}
page_text = requests.get(url, headers=headers, proxies=proxies).text

异步爬虫

多线程:

1
2
3
from multiprocessing import Pool
pool = Pool(4)
ret = pool.map(func, array)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def func(img_url):
name = unquote(img_url.split('/')[-1])
img_data = requests.get(url=img_url, headers=headers).content
with open(os.path.join('icon', name), 'wb') as fp:
fp.write(img_data)


if __name__ == "__main__":
url = 'http://prts.wiki/w/干员一览'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
img_urls = tree.xpath('//div[@class="smwdata"]/@data-icon')
pool = Pool(10)
pool.map(func, img_urls)
pool.close()
pool.join()

单线程+异步协程。

使用task或future:

1
2
3
4
5
6
7
8
9
10
11
12
13
async def func():
pass

c = func() # 获得协程对象
loop = asyncio.get_event_loop() # 创建事件循环

# 使用task
task = loop.create_task(c) # 获取task对象
loop.run_until_complete(task) # 注册并启动事件循环

# 使用future
future = asyncio.ensure_future(c) # 获取future对象
loop.run_until_complete(future) # 注册并启动事件循环

绑定回调函数:

1
2
3
4
5
6
def callback_func(future):
print(future.result()) # 打印任务的返回值
# 使用future
future = asyncio.ensure_future(c) # 获取future对象
future.add_done_callback(callback_func)
loop.run_until_complete(future) # 注册并启动事件循环
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import aiohttp
import aiofiles

async def func(img_url): # 定义协程函数,返回协程对象
name = unquote(img_url.split('/')[-1])
async with aiohttp.ClientSession() as session: # 异步的模块aiohttp
# get()/post() headers/params/data proxy='http://ip:port'
async with await session.get(url=img_url, headers=headers) as response: # await表示手动挂起这个函数
img_data = await response.read() # 二进制形式相应数据 字符串需要text() json需要json()
# async with aiofiles.open(os.path.join('icon', name), 'wb') as fp: # 异步文件io
# await fp.write(img_data)
with open(os.path.join('icon', name), 'wb') as fp:
fp.write(img_data)


if __name__ == "__main__":
url = 'http://prts.wiki/w/干员一览'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
img_urls = tree.xpath('//div[@class="smwdata"]/@data-icon')
tasks = [asyncio.ensure_future(func(img_url)) for img_url in img_urls] # 获取协程对象,装填为任务
loop = asyncio.get_event_loop() # 创建事件循环
loop.run_until_complete(asyncio.wait(tasks))

selenium

selenium是一个基于浏览器自动化的模块。需要相应浏览器的驱动程序。

selenium可以便捷地获得网页动态加载的数据,以及便捷地模拟登陆。

浏览器初始化。

1
2
3
4
5
6
7
8
9
10
11
12
# bro = webdriver.Chrome(executable_path='./chromeriver') # 填入驱动程序
# driver = webdriver.Chrome() # 已经将驱动程序放置在Python的Scripts目录下
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = "http://scxk.nmpa.gov.cn:81/xk/"
driver.get(url)
page_text = driver.page_source
tree = etree.HTML(page_text)
name_list = tree.xpath('//*[@id="gzlist"]/li/dl/@title')
[print(name) for name in name_list]
driver.quit()

启动配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
options = Options()
chrome_options = webdriver.ChromeOptions() # from .chrome.options import Options as ChromeOptions
# 设置为开发者模式,防止被识别出来使用了selenium
# chrome_options.add_experimental_option('excludeSwitches', ['enable-logging']) # 禁止打印日志
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation']) # 跟上面只能选一个
chrome_options.add_argument('--headless') # 无头模式
chrome_options.add_argument('--disable-gpu') # 上面代码就是为了将Chrome不弹出界面
chrome_options.add_argument('--start-maximized') # 最大化
chrome_options.add_argument('--incognito') # 无痕隐身模式
chrome_options.add_argument("disable-cache") # 禁用缓存
chrome_options.add_argument('disable-infobars')
chrome_options.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
browser = webdriver.Chrome(chrome_options=chrome_options)
1
2
3
4
5
6
7
8
9
10
url = "https://www.tmall.com/"
driver.get(url)
search_input = driver.find_element_by_id('mq')
search_input.send_keys('五年高考三年模拟')
btn = driver.find_element_by_xpath('//*[@id="mallSearch"]/form/fieldset/div/button') # elements会返回列表
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
btn.click()
driver.back()
driver.forward()
driver.quit()

当网页中使用了iframe嵌套网页时,必须切换到相应的作用域。

1
2
3
4
5
url = "https://www.runoob.com/try/try-cdnjs.php?filename=jqueryui-api-droppable"
driver.get(url)
driver.switch_to.frame('iframeResult')
driver.maximize_window()
div = driver.find_element_by_id('draggable')

动作链的使用,调用ActionChains的方法时,会将所有的操作按顺序存放在一个队列里,调用perform()方法时,队列中的事件会依次执行。

1
2
3
4
5
6
7
from selenium.webdriver import ActionChains
action = ActionChains(driver)
action.click_and_hold(div)
# action.move_by_offset(xoffset=250, yoffset=0)
action.drag_and_drop_by_offset(div, xoffset=250, yoffset=0).perform()
action.release()
action.perform() # 将动作链实现,其实支持链式编程
1
2
3
4
5
6
7
from PIL import Image
driver.save_screenshot(path) # 屏幕截图
location = div.location # 左上角的坐标
size = div.size
img = Image.open(path)
img = img.crop(rangle) # 裁剪
img.save(path)

scrapy框架

框架是继承了很多功能并且具有很强通用性的一个项目模板。

1
2
3
4
5
6
7
%创建scrapy工程%
scrapy startproject <name>
cd <name>
%创建爬虫文件%
scrapy genspider <spidername> <domain>
%运行爬虫%
scrapy crawl <spidername>

settings.py配置

1
2
ROBOTSTXT_OBEY = False # 遵循robots.txt协议
LOG_LEVEL = 'ERROR' # 日志只显示错误

article_txt
目录