本文最后更新于：2022年10月4日晚上

selenium 是一个调用浏览器进行自动化控制的包，支持Java Python CSharp Ruby JavaScript Kotlin。用它可以模拟点击、输入，进行人类的操作。本篇文章以python模拟点击验证码为例来说明selenium的使用。

本文非原创，而是作者的学习笔记

hcaptcha

HCaptcha 的 Demo 网站如下，打开之后，我们可以看到如下的验证码入口页面：
https://democaptcha.com/demo-form-eng/hcaptcha.html

点击复选框时，验证码会先通过其风险分析引擎判断当前用户的风险，如果是低风险用户，便可以直接通过，反之，验证码会弹出对话框，让我们回答对话框中的问题。其实这个比 ReCaptcha 简单一些，它的验证码图片每次一定是 3x3 的，没有 4x4 的，而且点击一个图之后不会再出现一个新的小图让我们二次选择，所以其破解思路也相对简单一些。‘

项目地址

https://github.com/Python3WebSpider/HCaptchaResolver

识别封装

import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.common.action_chains import ActionChains
from app.captcha_resolver import CaptchaResolver


class Solution(object):
    def __init__(self, url):
        self.browser = webdriver.Chrome()
        self.browser.get(url)
        self.wait = WebDriverWait(self.browser, 10)
        self.captcha_resolver = CaptchaResolver()

    def __del__(self):
        time.sleep(10)
        self.browser.close()

iframe 切换支持

这个验证码和 ReCaptcha 都是在 iframe 里面加载的，另外弹出的验证码图片又在另外一个 iframe 里面。所以需要切换 iframe

分别能够支持切换到入口对应的 iframe 和验证码本身对应的 iframe:

    def get_captcha_entry_iframe(self) -> WebElement:
        self.browser.switch_to.default_content()
        captcha_entry_iframe = self.browser.find_element_by_css_selector(
            '.h-captcha > iframe')
        return captcha_entry_iframe

    def switch_to_captcha_entry_iframe(self) -> None:
        captcha_entry_iframe: WebElement = self.get_captcha_entry_iframe()
        self.browser.switch_to.frame(captcha_entry_iframe)

    def get_captcha_content_iframe(self) -> WebElement:
        self.browser.switch_to.default_content()
        captcha_content_iframe = self.browser.find_element_by_xpath(
            '//iframe[contains(@title, "Main content")]')
        return captcha_content_iframe

    def switch_to_captcha_content_iframe(self) -> None:
        captcha_content_iframe: WebElement = self.get_captcha_content_iframe()
        self.browser.switch_to.frame(captcha_content_iframe)

触发验证码

    def trigger_captcha(self) -> None:
        self.switch_to_captcha_entry_iframe()
        captcha_entry = self.wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '#anchor #checkbox')))
        captcha_entry.click()
        time.sleep(2)
        self.switch_to_captcha_content_iframe()
        captcha_element: WebElement = self.get_captcha_element()
        if captcha_element.is_displayed:
            logger.debug('trigged captcha successfully')

首先调用 switch_to_captcha_entry_iframe 进行了 iframe 的切换，然后找到那个入口框对应的节点，然后点击一下。

点击完了之后我们再调用 switch_to_captcha_content_iframe 切换到验证码本身对应的 iframe 里面，查找验证码本身对应的节点是否加载出来了，如果加载出来了，那么就证明触发成功了。

怎么查找问题呢呢？用 Selenium 常规的节点搜索就好了：

    def get_captcha_target_text(self) -> WebElement:
        captcha_target_name_element: WebElement = self.wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.prompt-text')))
        return captcha_target_name_element.text

通过调用这个方法，我们就能得到上图中完整的问题文本了。

验证码识别

每张图片进行下载并转成 Base64 编码了，我们观察下它的 HTML 结构

每个验证码其实都对应了一个 .task-image 的节点，然后里面有个 .image-wrapper 的节点，在里面有一个 .image 的节点，那图片怎么呈现的呢？这里它是设置了一个 style CSS 样式，通过 CSS 的 backgroud 来设置了验证码图片的地址。

所以，我们要想提取验证码图片也比较容易了，我们只需要找出 .image 节点的 style 属性的内容，然后提取其中的 url 就好了。

得到 URL 之后，转下 Base64 编码，利用 captcha_resolver 就可以对内容进行识别了。

所以代码可以写为如下内容：

    def verify_captcha(self):
        # get target text
        self.captcha_target_text = self.get_captcha_target_text()
        logger.debug(
            f'captcha_target_text {self.captcha_target_text}'
        )
        # extract all images
        single_captcha_elements = self.wait.until(EC.visibility_of_all_elements_located(
            (By.CSS_SELECTOR, '.task-image .image-wrapper .image')))
        resized_single_captcha_base64_strings = []
        for i, single_captcha_element in enumerate(single_captcha_elements):
            single_captcha_element_style = single_captcha_element.get_attribute(
                'style')
            pattern = re.compile('url\("(https.*?)"\)')
            match_result = re.search(pattern, single_captcha_element_style)
            single_captcha_element_url = match_result.group(
                1) if match_result else None
            logger.debug(
                f'single_captcha_element_url {single_captcha_element_url}')
            with open(CAPTCHA_SINGLE_IMAGE_FILE_PATH % (i,), 'wb') as f:
                f.write(requests.get(single_captcha_element_url).content)
            resized_single_captcha_base64_string = resize_base64_image(
                CAPTCHA_SINGLE_IMAGE_FILE_PATH % (i,), (100, 100))
            resized_single_captcha_base64_strings.append(
                resized_single_captcha_base64_string)

        logger.debug(
            f'length of single_captcha_element_urls {len(resized_single_captcha_base64_strings)}')

用正则表达式提取出来了每张验证码图片的 url，提取出 url 之后，我们然后将其存入了 resized_single_captcha_base64_strings 列表里面。

这里的 Base64 编码我们单独定义了一个方法，传入了图片路径和调整大小，然后可以返回编码后的结果，定义如下：

from PIL import Image
import base64
from app.settings import CAPTCHA_RESIZED_IMAGE_FILE_PATH


def resize_base64_image(filename, size):
    width, height = size
    img = Image.open(filename)
    new_img = img.resize((width, height))
    new_img.save(CAPTCHA_RESIZED_IMAGE_FILE_PATH)
    with open(CAPTCHA_RESIZED_IMAGE_FILE_PATH, "rb") as f:
        data = f.read()
        encoded_string = base64.b64encode(data)
        return encoded_string.decode('utf-8')

图像识别

略

模拟点击

得到 true false 列表了，我们只需要将结果是 true 的序号提取出来，然后对这些验证码小图点击就好了，代码如下：

# click captchas
recognized_indices = [i for i, x in enumerate(recognized_results) if x]
logger.debug(f'recognized_indices {recognized_indices}')
click_targets = self.wait.until(EC.visibility_of_all_elements_located(
    (By.CSS_SELECTOR, '.task-image')))
for recognized_index in recognized_indices:
    click_target: WebElement = click_targets[recognized_index]
    click_target.click()
    time.sleep(random())

这里我们用 for 循环将 true false 列表转成了一个列表，列表的每个元素代表 true 在列表中的位置，其实就是我们的点击目标了。

然后接着我们获取了所有的验证码小图对应的节点，然后依次调用 click 方法进行点击即可。

这样我们就可以实现验证码小图的逐个识别了。

点击验证

好，那么有了上面的逻辑，我们就能完成整个 HCaptcha 的识别和点选了。

最后，我们模拟点击验证按钮就好了：

# after all captcha clicked
verify_button: WebElement = self.get_verify_button()
if verify_button.is_displayed:
    verify_button.click()
    time.sleep(3)

而 verfiy_button 的提取也是用 Selenium 即可：

1
2
3

def get_verify_button(self) -> WebElement:
    verify_button = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.button-submit')))
    return verify_button

校验结果

验证成功的标志就是出现一个绿色小对勾

def get_is_successful(self):
    self.switch_to_captcha_entry_iframe()
    anchor: WebElement = self.wait.until(EC.visibility_of_element_located((
        By.CSS_SELECTOR, '#anchor #checkbox'
    )))
    checked = anchor.get_attribute('aria-checked')
    logger.debug(f'checked {checked}')
    return str(checked) == 'true'

这里我们先切换了 iframe，然后检查了对应的 class 是否是符合期望的。

最后如果 get_is_successful 返回结果是 True，那就代表识别成功了，那就整个完成了。

如果返回结果是 False，我们可以进一步递归调用上述逻辑进行二次识别，直到识别成功即可。

# check if succeed
is_succeed = self.get_is_successful()
if is_succeed:
    logger.debug('verifed successfully')
else:
    self.verify_captcha()

ReCaptcha

待续

学习是人生第一要务 > selenium

#Python #爬虫 #selenium

selenium 模拟点击hcaptcha和recaptcha验证码

https://pawswrite.xyz/posts/13934.html

作者

Rainbow

发布于

2022年8月12日

许可协议

nginx regex rules 上一篇

selenium识别验证码是否通过下一篇