[python] 基于wordcloud库绘制词云图

news/2025/2/19 9:30:06/

词云Wordcloud是文本数据的一种可视化表示方式。它通过设置不同的字体大小或颜色来表现每个术语的重要性。词云在社交媒体中被广泛使用,因为它能够让读者快速感知最突出的术语。然而,词云的输出结果没有统一的标准,也缺乏逻辑性。对于词频相差较大的词汇有较好的区分度,但对于颜色相近、频次相近的词汇来说效果并不好。因此词云不适合应用于科学绘图。本文基于python库wordcloud来绘制词云。wordcloud安装方式如下:

pip install wordcloud

文章目录

  • 0 wordcloud绘图说明
  • 1 绘图实例
    • 1.1 单个单词绘制词云
    • 1.2 基础绘制
    • 1.3 自定义词云形状
    • 1.4 使用词频字典绘图
    • 1.5 颜色更改
    • 1.6 为特定词设置颜色
    • 1.7 绘制中文词云
  • 2 参考

0 wordcloud绘图说明

wordcloud库关于绘制词云的相关函数均由其内置类WordCloud提供。

WordCloud类初始函数如下:

WordCloud(font_path=None, width=400, height=200, margin=2,ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,color_func=None, max_words=200, min_font_size=4,stopwords=None, random_state=None, background_color='black',max_font_size=None, font_step=1, mode="RGB",relative_scaling='auto', regexp=None, collocations=True,colormap=None, normalize_plurals=True, contour_width=0,contour_color='black', repeat=False,include_numbers=False, min_word_length=0, collocation_threshold=30)

初始函数参数介绍如下:

参数类型说明
font_pathstr字体路径,中文词云绘制必须要提供字体路径
widthint输出画布宽度
heightint输出画布高度
marginint输出画布每个词汇边框边距
prefer_horizontalfloat词汇水平方向排版出现的频率
masknumpy-array为空使用默认mask绘制词云,非空用给定mask绘制词云且宽高值将被忽略
scalefloat按照比例放大画布长宽
color_funcfunc颜色设置函数
max_wordsint最大统计词数
min_font_sizeint最小字体尺寸
stopwordslist绘图要过滤的词
random_stateint随机数,主要用于设置颜色
background_colorstr背景颜色
max_font_sizeint最大字体尺寸
font_stepint字体步长
modestrpillow image的绘图模式
relative_scalingfloat词频和字体大小的关联性
regexpstr使用正则表达式分隔输入的文本
collocationsbool是否包括两个词的搭配
colormapstr给每个单词随机分配颜色,若指定color_func,则忽略该方法
normalize_pluralsbool英文单词是否用单数替换复数
contour_widthint词云轮廓尺寸
contour_colorstr词云轮廓颜色
repeatbool是否重复输入文本直到允许的最大词数
include_numbersbool是否包含数字作为短语
min_word_lengthint单词包含最少字母数

WordCloud类提供的主要函数接口如下:

  • generate_from_frequencies(frequencies):根据词频生成词云
  • fit_words(frequencies):等同generate_from_frequencies函数
  • process_text(text):分词
  • generate_from_text(text):根据文本生成词云
  • generate(text):等同generate_from_text
  • to_image:输出绘图结果为pillow image
  • recolor:重置颜色
  • to_array:输出绘图结果为numpy array
  • to_file(filename):保存为文件
  • to_svg:保存为svg文件

1 绘图实例

1.1 单个单词绘制词云

import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloudtext = "hello"# 返回两个数组,只不过数组维度分别为n*1 和 1* m
x, y = np.ogrid[:300, :300]# 设置绘图区域
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)# 绘制词云,repeat表示重复输入文本直到允许的最大词数max_words,scale设置放大比例
wc = WordCloud(background_color="white", repeat=True,max_words=32, mask=mask,scale=1.5)
wc.generate(text)plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.show()# 输出到文件
_ = wc.to_file("result.jpg")

png

1.2 基础绘制


from wordcloud import WordCloud# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim PetersBeautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:f.write(scr_text)# 读取文本
with open(text_path,'r',encoding='utf-8') as f:# 这里text是一个字符串text = f.read()
# 生成词云, WordCloud对输入的文本text进行切词展示。
wordcloud = WordCloud().generate(text)import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()

png

# 修改显示的最大的字体大小
wordcloud = WordCloud(max_font_size=50).generate(text)# 另外一种展示结果方式
image = wordcloud.to_image()
image.show()

png

1.3 自定义词云形状

from PIL import Image
import numpy as np
import matplotlib.pyplot as pltfrom wordcloud import WordCloud, STOPWORDS# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim PetersBeautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:f.write(scr_text)# 读取文本
with open(text_path,'r',encoding='utf-8') as f:# 这里text是一个字符串text = f.read()# 想生成带特定形状的词云,首先得准备具备该形状的mask图片
# 在mask图片中除了目标形状外,其他地方都是空白的
mask = np.array(Image.open("mask.png"))# 要跳过的词
stopwords = set(STOPWORDS)
# 去除better
stopwords.add("better")# contour_width绘制mask边框宽度,contour_color设置mask区域颜色
# 如果mask边框绘制不准,设置contour_width=0表示不绘制边框
wc = WordCloud(background_color="white", max_words=2000, mask=mask,stopwords=stopwords, contour_width=2, contour_color='red',scale=2,repeat=True)# 生成图片
wc.generate(text)# 存储文件
wc.to_file("result.png")# 展示词云结果
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
# 展示mask图片
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

1.4 使用词频字典绘图

# pip install multidict安装
import multidict as multidictimport numpy as npimport re
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt# 统计词频
def getFrequencyDictForText(sentence):fullTermsDict = multidict.MultiDict()tmpDict = {}# 按照空格分词for text in sentence.split(" "):# 如果匹配到相关词,就跳过,这样做可以获得定制度更高的结果if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):continueval = tmpDict.get(text, 0)tmpDict[text.lower()] = val + 1# 生成词频字典for key in tmpDict:fullTermsDict.add(key, tmpDict[key])return fullTermsDictdef makeImage(text):mask = np.array(Image.open("mask.png"))wc = WordCloud(background_color="white", max_words=1000, mask=mask, repeat=True)wc.generate_from_frequencies(text)plt.imshow(wc, interpolation="bilinear")plt.axis("off")plt.show()# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim PetersBeautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:f.write(scr_text)# 读取文本
with open(text_path,'r',encoding='utf-8') as f:# 这里text是一个字符串text = f.read()# 获得词频字典
fullTermsDict = getFrequencyDictForText(text)
# 绘图
makeImage(fullTermsDict)

png

1.5 颜色更改

from PIL import Image
import numpy as np
import matplotlib.pyplot as pltfrom wordcloud import WordCloud, STOPWORDS, ImageColorGenerator# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim PetersBeautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:f.write(scr_text)# 读取文本
with open(text_path,'r',encoding='utf-8') as f:# 这里text是一个字符串text = f.read()# 图片地址https://github.com/amueller/word_cloud/blob/master/examples/alice_color.png
alice_coloring = np.array(Image.open("alice_color.png"))
stopwords = set(STOPWORDS)
stopwords.add("better")wc = WordCloud(background_color="white", max_words=500, mask=alice_coloring,stopwords=stopwords, max_font_size=50, random_state=42,repeat=True)
# 生成词云结果
wc.generate(text)
# 绘制
image = wc.to_image()
image.show()# 绘制类似alice_coloring颜色的词云图片
# 从图片中提取颜色
image_colors = ImageColorGenerator(alice_coloring)
# 重新设置词云颜色
wc.recolor(color_func=image_colors)
# 绘制
image = wc.to_image()
image.show()# 展示mask图片
plt.imshow(alice_coloring, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

png

1.6 为特定词设置颜色

from wordcloud import (WordCloud, get_single_color_func)
import matplotlib.pyplot as plt# 直接赋色函数
class SimpleGroupedColorFunc(object):def __init__(self, color_to_words, default_color):# 特定词颜色self.word_to_color = {word: colorfor (color, words) in color_to_words.items()for word in words}# 默认词颜色self.default_color = default_colordef __call__(self, word, **kwargs):return self.word_to_color.get(word, self.default_color)class GroupedColorFunc(object):def __init__(self, color_to_words, default_color):self.color_func_to_words = [(get_single_color_func(color), set(words))for (color, words) in color_to_words.items()]self.default_color_func = get_single_color_func(default_color)def get_color_func(self, word):"""Returns a single_color_func associated with the word"""try:color_func = next(color_func for (color_func, words) in self.color_func_to_wordsif word in words)except StopIteration:color_func = self.default_color_funcreturn color_funcdef __call__(self, word, **kwargs):return self.get_color_func(word)(word, **kwargs)text = """The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""# 直接输入文本时,在统计数据时是否包括两个词的搭配
wc = WordCloud(collocations=False).generate(text.lower())# 为特定词设置颜色
color_to_words = {'green': ['beautiful', 'explicit', 'simple', 'sparse','readability', 'rules', 'practicality','explicitly', 'one', 'now', 'easy', 'obvious', 'better'],'#FF00FF': ['ugly', 'implicit', 'complex', 'complicated', 'nested','dense', 'special', 'errors', 'silently', 'ambiguity','guess', 'hard']
}# 设置除特定词外其他词的颜色为grey
default_color = 'grey'# 直接赋色函数,直接按照color_to_words设置的RGB颜色绘图,输出的颜色不够精细
# grouped_color_simple = SimpleGroupedColorFunc(color_to_words, default_color)# 更精细的赋色函数,将color_to_words设置的RGB颜色转到hsv空间,然后进行绘图
grouped_color = GroupedColorFunc(color_to_words, default_color)# 应用颜色函数
wc.recolor(color_func=grouped_color)# 绘图
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

png

1.7 绘制中文词云

import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
# 读取文本
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/CalltoArms.txt
with open('CalltoArms.txt','r',encoding='utf-8') as f:text = f.read()# 中文必须设置字体文件
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/fonts/SourceHanSerif/SourceHanSerifK-Light.otf
font_path =  'SourceHanSerifK-Light.otf'# 不用于绘制词云的词汇列表
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/stopwords_cn_en.txt
stopwords_path = 'stopwords_cn_en.txt'
# 词云
# 模板图片
back_coloring = np.array(Image.open("alice_color.png"))# 向jieba分词词典添加新的词语
userdict_list = ['阿Q', '孔乙己', '单四嫂子']# 分词
def jieba_processing_txt(text):for word in userdict_list:jieba.add_word(word)mywordlist = []# 分词seg_list = jieba.cut(text, cut_all=False)liststr = "/ ".join(seg_list)with open(stopwords_path, encoding='utf-8') as f_stop:f_stop_text = f_stop.read()f_stop_seg_list = f_stop_text.splitlines()for myword in liststr.split('/'):if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:mywordlist.append(myword)return ' '.join(mywordlist)
# 文字处理
text = jieba_processing_txt(text)# margin设置词云每个词汇边框边距
wc = WordCloud(font_path=font_path, background_color="black", max_words=2000, mask=back_coloring,max_font_size=100, random_state=42, width=1000, height=860, margin=5,contour_width=2,contour_color='blue')wc.generate(text)# 获得颜色
image_colors_byImg = ImageColorGenerator(back_coloring)plt.imshow(wc.recolor(color_func=image_colors_byImg), interpolation="bilinear")
plt.axis("off")
plt.figure()
plt.imshow(back_coloring, interpolation="bilinear")
plt.axis("off")
plt.show()

png

png

2 参考

  • wordcloud
  • Wordcloud各参数含义

http://www.ppmy.cn/news/2264.html

相关文章

MySQL经典案例50题

数据准备 建表、插入数据 -- 学生表 CREATE TABLE Student( s_id VARCHAR(20), s_name VARCHAR(20) NOT NULL DEFAULT, s_birth VARCHAR(20) NOT NULL DEFAULT, s_sex VARCHAR(10) NOT NULL DEFAULT, PRIMARY KEY(s_id) ); -- 课程表 CREATE TABLE Course( c_id VARCHAR(20),…

前端Gitee + Jenkins自动化实战(转)

# 前端Gitee Jenkins自动化实战 当我们熟悉了Jenkins 的使用后,接下来我们来配置一个基于 vue-element-admin 的实战项目,来感受一下自动化流程带给我们的优势。 首先我们需要创建一个 git 项目,至于代码仓库可以自选,这里为了…

《自己动手写CPU》学习记录(1)——第1章

引言 此专栏的文章记录自己学习《自己动手写CPU》的过程。算是一个学习笔记,里面也会夹杂个人的思考以及代码编写。希望自己可以像作者一样,坚持到最后。加油~~ 本篇学习MIPS32处理器的基本架构。 致谢 感谢书籍《自己动手写CPU》及其作者雷思磊。一…

Unity 符号表

目录 前言 关于Unity符号表 正文 程序crash日志: 解析 后记 记一次 Bugly 崩溃查找过程 unity-il2cpp: 前言 关于Unity符号表 关于项目真机调试时的崩溃问题,一般可以 logcat 或 xcode 看到相关的crash日志,拿到崩溃时的堆…

HTML期末作业——基于html实现娱乐音乐资讯发布平台HTML模板(22页面)

🎉精彩专栏推荐 💭文末获取联系 ✍️ 作者简介: 一个热爱把逻辑思维转变为代码的技术博主 💂 作者主页: 【主页——🚀获取更多优质源码】 🎓 web前端期末大作业: 【📚毕设项目精品实战案例 (10…

【小游戏】Unity游戏愤怒的足球(小鸟)

目录 1.弹弓逻辑 2.鸟的逻辑 3.GameManager主逻辑 文末有源工程地址 难度系数: ★★★★☆ 游戏玩法: 愤怒的足球,其实就是经典的愤怒的小鸟换图 项目简介: 功能完善,主要代码逻辑完整 本文内容: 记录一下这个工程,对内部代码逻辑没有深入了解有待以后发掘 1.弹弓逻…

web安全之通过sqlmap工具进行靶场练习

目录 基础语法 get类型的注入 post类型的注入 基础语法 -u:用于get提交方式,后面跟注入的url网址 --dbs:获取所有数据库 --tables:获取所有数据表 --columns:获取所有字段 --dump:打印数据 -D:查询选择某…

Nginx的安装与负载均衡、动静分离的初步使用(Windows)

了解Nginx Nginx的作用: 反向代理 正反向代理是什么? 正向代理:代客户端访问服务端 反向代理:代服务端接收客户端的访问 正反向代理示例: 正向代理:客户端A1、客户端A2...----->正向代理服务器------&…