首頁/博客

使用代理的Python爬虫：突破IP封锁的关键技能

# 使用代理的Python爬虫：突破IP封锁的关键技能 ## 一、为什么需要代理？在进行网页数据采集时，如果请求频率过高或目标网站设置了反爬机制，很容易出现以下问题： - HTTP 403 / 429 错误（被禁止访问或请求太频繁） - 页面加载失败或跳转至验证码页 - IP 被短期或永久封禁这时就需要使用**代理IP**，通过更换出口IP来“伪装”成不同的客户端，从而规避限制。 --- ## 二、代理的类型常见代理类型包括： - **HTTP代理**：适用于普通网页请求 - **HTTPS代理**：支持SSL加密请求 - **SOCKS代理**：更底层，支持更广泛的协议（需第三方库支持，如 `requests[socks]`） --- ## 三、使用方式：以 `requests` 为例 ### 安装依赖 ```bash pip install requests ``` ### 示例：使用代理请求网页 ```python import requests # 代理设置：可以是公网免费代理，也可以是付费代理 proxies = { 'http': 'http://123.56.78.90:8080', 'https': 'http://123.56.78.90:8080' } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' } try: response = requests.get('https://httpbin.org/ip', headers=headers, proxies=proxies, timeout=5) print('使用代理访问成功：', response.json()) except requests.exceptions.RequestException as e: print('代理访问失败：', e) ``` > 说明：你可以通过 [httpbin.org/ip](https://httpbin.org/ip) 来验证你当前请求所使用的IP。 --- ## 四、爬取目标网站示例（含代理池支持）以下示例爬取豆瓣 Top250，并尝试使用代理池策略（从列表中随机选择代理）： ```python import requests from bs4 import BeautifulSoup import random import time proxy_list = [ 'http://123.56.78.90:8080', 'http://45.90.12.34:3128', 'http://180.76.154.5:8888' ] headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' } def fetch_page(url): for _ in range(5): # 尝试最多5次 proxy = {'http': random.choice(proxy_list), 'https': random.choice(proxy_list)} try: print(f"使用代理：{proxy['http']}") response = requests.get(url, headers=headers, proxies=proxy, timeout=5) if response.status_code == 200: return response.text except: print('请求失败，更换代理重试...') time.sleep(1) return None def parse_and_print(html): soup = BeautifulSoup(html, 'html.parser') for item in soup.select('.item'): title = item.select_one('.title').text.strip() rating = item.select_one('.rating_num').text.strip() print(f'{title} - 评分：{rating}') if __name__ == '__main__': url = 'https://movie.douban.com/top250' html = fetch_page(url) if html: parse_and_print(html) else: print('抓取失败，代理可能全部失效。') ``` --- ## 五、如何获取代理？ - **代理网站**（稳定，容易）： https://www.okkk.tech/ 你也可以结合自己的爬虫写一个“代理验证器”，筛选出可用代理自动存入本地池中。 --- ## 六、结语代理是破解反爬策略的核心手段之一，但并不是万能的。配合其他技巧（如请求间隔控制、浏览器头模拟、验证码识别等）一起使用，效果更佳。

Proxy

首頁/博客

使用代理的Python爬虫：突破IP封锁的关键技能

聯繫方式