爬虫逆向之字体反爬(二)、镀金的天空-字体反爬-2

news/2024/11/8 7:32:17/

趁热打铁来写字体反爬的第二篇,首先是题目

image.png

网页上显示的不是常规的数字,源码里面也是一些汉字

image.png

虽然看上去很乱,但是仔细观察还是能发现一些规律,比如:长 对应 2,思对应 1

所以这里的解题思路,也是先找到这些汉字的映射表

还是像第一题那样,把网页里面的base64保存成本地的ttf文件,然后打开查看:

image.png

这里不再是像第一题的英文单词,而是 uni601D 这样形似 unicode 编码的,解码一下试试

In [10]: a
Out[10]: '\\u601D'In [11]: a.encode("latin-1").decode("unicode_escape")
Out[11]: '思'

unicode_escape 是一种编码集,类似 utf-8,这种编码集直接将unicode内存编码存储进文件

参考:https://blog.csdn.net/qq_40728667/article/details/122282693

各个编码单字节的外围:ASCII(0-127), LATIN1(0-255), UTF8(0-253)

参考:https://blog.csdn.net/liuhhaiffeng/article/details/80162033

和源码中的汉字是一致的,那么我们只要拿到前面十个unicode编码,解码出它们对应的汉字,然后就能根据顺序得出每个汉字对应的数字

def parse_ttf():"""构造汉字到数字的映射"""font = TTFont("page-1.ttf")name_list = font["cmap"].tables[0].ttFont.getGlyphOrder()unicode_list = ["\\u" + c[-4:] for c in name_list[1:11]]cha2num = {}for index, u in enumerate(unicode_list):c = u.encode("latin-1").decode("unicode_escape")cha2num[c] = str(index)return cha2num

同一个汉字不同unicode?

从ttf提取的汉字,可能会出现和网页的汉字对不上的情况

比如网页的是 ,ttf的是 ,两者看起来相似,但是unicode码是不同的

In [1]: '鼻'.encode('unicode_escape')
Out[1]: b'\\u9f3b'In [2]: '⿐'.encode('unicode_escape')
Out[2]: b'\\u2fd0'

这里有解释这种现象

产生的具体原因我找了很久也没找到,猜测是网页在自动utf8解码的时候,把汉字转换错了

为了应对这种情况,我之前有专门整理大部分这些生僻字的映射,基本可以解决这道题出现的所有情况

Unicode基本汉字、部首扩展、康熙部首对照字典_银古 | cxs的博客-CSDN博客_unicode 常用汉字

{"⼀": "一","⼄": "乙","⼆": "二","⼈": "人","⼉": "儿","⼊": "入","⼋": "八","⼏": "几","⼑": "刀","⼒": "力","⼔": "匕","⼗": "十","⼘": "卜","⼚": "厂","⼜": "又","⼝": "口","⼞": "口","⼟": "土","⼠": "士","⼤": "大","⼥": "女","⼦": "子","⼨": "寸","⼩": "小","⼫": "尸","⼭": "山","⼯": "工","⼰": "己","⼲": "干","⼴": "广","⼸": "弓","⼼": "心","⼽": "戈","⼿": "手","⽀": "支","⽂": "文","⽃": "斗","⽄": "斤","⽅": "方","⽆": "无","⽇": "日","⽈": "曰","⽉": "月","⽊": "木","⽋": "欠","⽌": "止","⽍": "歹","⽏": "毋","⽐": "比","⽑": "毛","⽒": "氏","⽓": "气","⽔": "水","⽕": "火","⽖": "爪","⽗": "父","⽚": "片","⽛": "牙","⽜": "牛","⽝": "犬","⽞": "玄","⽟": "玉","⽠": "瓜","⽡": "瓦","⽢": "甘","⽣": "生","⽤": "用","⽥": "田","⽩": "白","⽪": "皮","⽫": "皿","⽬": "目","⽭": "矛","⽮": "矢","⽯": "石","⽰": "示","⽲": "禾","⽳": "穴","⽴": "立","⽵": "竹","⽶": "米","⽸": "缶","⽹": "网","⽺": "羊","⽻": "羽","⽼": "老","⽽": "而","⽿": "耳","⾁": "肉","⾂": "臣","⾃": "自","⾄": "至","⾆": "舌","⾈": "舟","⾉": "艮","⾊": "色","⾍": "虫","⾎": "血","⾏": "行","⾐": "衣","⾒": "儿","⾓": "角","⾔": "言","⾕": "谷","⾖": "豆","⾚": "赤","⾛": "走","⾜": "足","⾝": "身","⾞": "车","⾟": "辛","⾠": "辰","⾢": "邑","⾣": "酉","⾤": "采","⾥": "里","⾦": "金","⾧": "长","⾨": "门","⾩": "阜","⾪": "隶","⾬": "雨","⾭": "青","⾮": "非","⾯": "面","⾰": "革","⾲": "韭","⾳": "音","⾴": "页","⾵": "风","⾶": "飞","⾷": "食","⾸": "首","⾹": "香","⾺": "马","⾻": "骨","⾼": "高","⿁": "鬼","⿂": "鱼","⿃": "鸟","⿄": "卤","⿅": "鹿","⿇": "麻","⿉": "黍","⿊": "黑","⿍": "鼎","⿎": "鼓","⿏": "鼠","⿐": "鼻","⿒": "齿","⿓": "龙","⼣": "夕","⺁":"厂","⺇":"几","⺌":"小","⺎":"兀","⺏":"尣","⺐":"尢","⺑":"𡯂","⺒":"巳","⺓":"幺","⺛":"旡","⺝":"月","⺟":"母","⺠":"民","⺱":"冈","⺸":"芈","⻁":"虎","⻄":"西","⻅":"见","⻆":"角","⻇":"𧢲","⻉":"贝","⻋":"车","⻒":"镸","⻓":"长","⻔":"门","⻗":"雨","⻘":"青","⻙":"韦","⻚":"页","⻛":"风","⻜":"飞","⻝":"食","⻡":"𩠐","⻢":"马","⻣":"骨","⻤":"鬼","⻥":"鱼","⻦":"鸟","⻧":"卤","⻨":"麦","⻩":"黄","⻬":"齐","⻮":"齿","⻯":"竜","⻰":"龙","⻳":"龟","⾅":"臼","⼝":"口","⼾":"户","⼉":"儿","⼱":"巾"
}

代码整理

# -*- coding:utf-8 -*-from io import BytesIO
from hanzi_mappings import hz_mapsimport requests
import base64
from parsel import Selector
from fontTools.ttLib import TTFontdef parse_ttf(b64_str):"""构造汉字到数字的映射"""content = base64.b64decode(b64_str)# with open("page.ttf", "wb") as f:#    f.write(content)font = TTFont(BytesIO(content))name_list = font["cmap"].tables[0].ttFont.getGlyphOrder()unicode_list = ["\\u" + c[-4:] for c in name_list[1:11]]cha2num = {}for index, u in enumerate(unicode_list):c = u.encode("latin-1").decode("unicode_escape")cha2num[c] = str(index)return cha2numdef parse_html(html_text):dom = Selector(text=html_text)b64_str = dom.css("style::text").re_first(r"base64,(.+?)\)")mappings = parse_ttf(b64_str)fake_strings = dom.css(".col-md-1::text").re(r"\s+(.+)\s+")for string in fake_strings:real_num = ""for c in string:try:n = mappings[c]except KeyError as e:n = mappings[hz_maps[c]]real_num += nyield real_numdef glidedsky_login():"""网站登录,才能看到题目注意题目域名也必须是 www.glidedsky.com"""EMAIL = ""PASSWORD = ""LOGIN_URL = "http://www.glidedsky.com/login"session = requests.session()resp = session.get(LOGIN_URL)dom = Selector(resp.text)_token = dom.css("meta[name='csrf-token']::attr(content)").get()form_data = {"_token": _token,"email": EMAIL,"password": PASSWORD,}session.post(LOGIN_URL, data=form_data)return sessiondef main():session = glidedsky_login()for page in range(1, 11):url = "http://www.glidedsky.com/level/web/crawler-font-puzzle-2?page=" + str(page)resp = session.get(url)print(f"page {page}: ",[num for num in parse_html(resp.content.decode())],)if __name__ == "__main__":main()
  • 直接从源码截取的base64,可能会长度不够,报错:Incorrect padding,需要补齐成4的整倍数

  • hz_maps 即是上面整理的那个汉字映射

运行结果

image.png


http://www.ppmy.cn/news/11423.html

相关文章

传统推荐模型(一)协同过滤算法_UserCF和ItemCF

传统推荐模型(一)协同过滤算法_UserCF 1、UserCF 协同过滤就是协同大家的反馈、评价和意见一起对海量的信息进行过滤,从中筛选出目标用户可能感兴趣的信息的推荐过程。 物品1物品2物品3物品4物品5用户131233用户243435用户333154用户41552…

联合体(共用体) :(笔记补充)

目录 一.联合体的基本概念 二.相关面试题 三.联合体大小计算 关于结构体的内存对齐:http://t.csdn.cn/fbQuo 一.联合体的基本概念 联合也是一种特殊的自定义类型,这种类型定义的变量也包含一系列的成员,特征是这些成员公用同一块空间(所以…

【寒假每日一题】洛谷 P6529 [COCI2015-2016#1] KARTE

题目链接:P6529 [COCI2015-2016#1] KARTE - 洛谷 | 计算机科学教育新生态 (luogu.com.cn) 题目描述 这里有一堆牌,可惜它们似乎不全。 您需要找出每种花色缺失的张数。 如果有相同的扑克牌,请输出 GRESKA。 输入格式 您要读取的是一个字…

JS实现给json数组动态赋值的方法

Json 数组也是数组: //1、 var jsonstr"[{name:a,value:1},{name:b,value:2}]"; var jsonarray eval((jsonstr)); var arr { "name" : $(#names).val(), "value" : $(#values).val() } jsonarray.push(arr); //2、 var json{};// 定…

现代JavaScript,你应该使用的10件事

javascripttip(3 部分系列)1现代 JavaScript,你应该使用的 10 件事,从今天开始2了解如何在 JavaScript 中使用循环3如何在 JavaScript 中学习足够多的 RegEx 才能变得危险您可能对 JavaScript 完全陌生,也可能多年来只…

express中间件

文章目录中间件定义一个最简单的中间件自定义中间件中间件的五个使用注意事项Express 基于 Connect 构建而成,因此,它也保持了重用中间件来完成基础任务的想法。这就意味着,通过 Express 的 API 方便地构建 Web 应用地同时,又不失…

(十二)devops持续集成开发——jenkins的全局工具配置之sonar qube环境安装及配置

前言 本节内容我们主要介绍一下在jenkins中如何集成sonar qube代码质量检查工具,sonar qube可以在流水化项目集成部署前对我们的代码质量检查。开始本节内容前我们需要先搭建好sonar qube服务,关于sonar qube服务的搭建可参考作者往期博客内容&#xff…

set 方法是坏味道?

1 满天Setter public void approve(final long bookId) {...book.setReviewStatus(ReviewStatus.APPROVED);... }对作品进行审核:通过 bookId,找到对应的作品,接下来,将审核状态设置成审核通过。setter 往往是缺乏封装的一种做法…