Python 验证码识别(使用pytesseract库)

tulisx0

于 2025-06-26 07:45:00 发布

阅读量540

点赞数 9

CC 4.0 BY-SA版权

文章标签： python 开发语言

本文链接：https://blog.csdn.net/tulisx0/article/details/148872157

Python 验证码识别（使用 pytesseract 库）

验证码识别是自动化测试和数据爬取中的常见需求。Python 提供了强大的工具库 pytesseract，它是 Tesseract OCR 引擎的 Python 封装，可用于识别图像中的文本。以下是使用 pytesseract 进行验证码识别的详细方法。

安装依赖库

确保系统中安装了 Tesseract OCR 引擎。在 Linux 系统中可以通过以下命令安装：

sudo apt install tesseract-ocr

在 Windows 系统中，可以从 Tesseract 官方 GitHub 下载安装包。

安装 Python 依赖库：

pip install pytesseract pillow

基本验证码识别

假设验证码图像为 captcha.png，可以使用以下代码进行识别：

import pytesseract
from PIL import Image

# 加载图像
image = Image.open('captcha.png')

# 使用 pytesseract 识别文本
text = pytesseract.image_to_string(image)

print("识别结果:", text)

图像预处理提高识别率

验证码通常包含噪声或干扰线，直接识别效果可能不佳。可通过以下预处理步骤提高识别率：

import cv2
import numpy as np

# 读取图像并转为灰度
image = cv2.imread('captcha.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# 二值化处理
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# 保存预处理后的图像
cv2.imwrite('processed_captcha.png', thresh)

# 识别预处理后的图像
text = pytesseract.image_to_string(Image.open('processed_captcha.png'))
print("识别结果:", text)

调整识别参数

pytesseract 支持多种配置参数，可通过 config 参数调整识别模式：

# 使用白名单（仅识别数字）
digits_only = pytesseract.image_to_string(image, config='--psm 6 outputbase digits')

# 使用特定的 OCR 引擎模式
text = pytesseract.image_to_string(image, config='--psm 10 --oem 3')

处理复杂验证码

对于扭曲或背景复杂的验证码，可能需要更高级的图像处理技术：

# 去除噪声（中值滤波）
denoised = cv2.medianBlur(gray, 3)

# 使用边缘检测增强文本
edges = cv2.Canny(denoised, 50, 150)

# 识别处理后的图像
text = pytesseract.image_to_string(Image.fromarray(edges))
print("识别结果:", text)

批量识别验证码

若需批量处理多个验证码文件，可以使用以下代码：

import os

# 遍历目录中的验证码文件
for filename in os.listdir('captchas'):
    if filename.endswith('.png'):
        image_path = os.path.join('captchas', filename)
        text = pytesseract.image_to_string(Image.open(image_path))
        print(f"{filename}: {text.strip()}")