time: 2018/04/10
w3lib 是scrapy的基础插件,用来处理html,灰常的好用,清理文本带有HTML标签的数据;
官方文档
w3lib.encoding.html_body_declared_encoding(html_body_str)
作用是返回网页的编码,如果网页有<meta charset=utf-8>
就会返回utf-8w3lib.encoding.http_content_type_encoding
返回htt头里面的编码格式
>>> import w3lib.encoding >>>w3lib.encoding.http_content_type_encoding("Content-Type: text/html; charset=ISO-8859-4") 'iso8859-4'
3.w3lib.encoding.read_bom(data)
返回 BOM的编码
>>> import w3lib.encoding
>>> w3lib.encoding.read_bom(b'\xfe\xff\x6c\x34')
('utf-16-be', '\xfe\xff')
>>> w3lib.encoding.read_bom(b'\xff\xfe\x34\x6c')
('utf-16-le', '\xff\xfe')>>>w3lib.encoding.read_bom(b'\x00\x00\xfe\xff\x00\x00\x6c\x34')('utf-32-be', '\x00\x00\xfe\xff')
>>>w3lib.encoding.read_bom(b'\xff\xfe\x00\x00\x34\x6c\x00\x00')('utf-32-le', '\xff\xfe\x00\x00')
>>> w3lib.encoding.read_bom(b'\x01\x02\x03\x04')(None, None)
4.w3lib.encoding.to_unicode(data_str, encoding)
返回unicode编码
5.w3lib.html.remove_comments(text, encoding=None)
去除注释:
>>> import w3lib.html>>> w3lib.html.remove_comments(b"test <!--textcoment--> whatever")u'test whatever'>>>
6.w3lib.html.remove_tags(text, which_ones=(), keep=(), encoding=None)
去除标签,获取标签的text:
>>> import w3lib.html>>> doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>'>>> w3lib.html.remove_tags(doc)u'This is a link: example'>>>
1.保留需要的标签:
>>> w3lib.html.remove_tags(doc, keep=('div',))
u'<div>This is a link: example</div>'
>>>
2. 选择去除的标签:
>>> w3lib.html.remove_tags(doc, which_ones=('a','b'))
u'<div><p>This is a link: example</p></div>'
>>>3. 注意不能即保留又删除标签;
>>> w3lib.html.remove_tags(doc, which_ones=('a',), keep=('p',))
Traceback (most recent call last):File "<stdin>", line 1, in <module>File "/usr/local/lib/python2.7/dist-packages/w3lib/html.py", line 101, in remove_tagsassert not (which_ones and keep), 'which_ones and keep can not be given at the same time'
AssertionError: which_ones and keep can not be given at the same time
>>>
7.w3lib.html.remove_tags_with_content(text,which_ones=(), encoding=None)
删除标签并把标签的内容删掉;
>>> import w3lib.html>>> doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>'>>> w3lib.html.remove_tags_with_content(doc, which_ones=('b',))u'<div><p> <a href="http://www.example.com">example</a></p></div>'>>>
8.w3lib.html.replace_entities(text, keep=(), remove_illegal=True, encoding='utf-8')
>>> import w3lib.html>>> w3lib.html.replace_entities(b'Price: £100')u'Price: \xa3100'>>> print(w3lib.html.replace_entities(b'Price: £100'))Price: £100>>>
9.w3lib.html.replace_tags(text, token='', encoding=None)
替换标签为XXX
>>> import w3lib.html>>> w3lib.html.replace_tags(u'This text contains <a>some tag</a>')u'This text contains some tag'>>> w3lib.html.replace_tags('<p>Je ne parle pas <b>fran\xe7ais</b></p>', ' -- ', 'latin-1')u' -- Je ne parle pas -- fran\xe7ais -- -- '>>>
10.w3lib.http.headers_raw_to_dict(headers_raw)
原始头转换成字典:
>>> import w3lib.http>>> w3lib.http.headers_raw_to_dict(b"Content-type: text/html\n\rAccept: gzip\n\n") {'Content-type': ['text/html'], 'Accept': ['gzip']}