唐诗三百首的爬取以及出现最多字数的统计_随笔

唐诗三百首的爬取以及出现最多字数的统计

1、首先放上我们需要爬取的网页链接：

https://so.gushiwen.cn/gushi/tangshi.aspx

2、分析网页

进入开发者模式，找到我们需要的信息。

我们发现所有的唐诗名称以及作者名字都在中，我们随便点击一个唐诗进去转到另一个网页中，多看几个我们就可以发现所有的网页链接都是由https://so.gushiwen.cn + herf组成

诗的正文都在的中。

3、正文提取

用requests.get方法获取网页内容。

用正则表达式提取我们需要的内容

response提取的是每一首唐诗链接的后部分，以及诗的名字还有作者。

然后我们需要将每一个组成的链接访问一遍，通过正则表达式获取诗句内容。

通过文件处理将得到的信息写入.txt文件中。

4、词云提取

在网页中，由于每两句诗中间都由
隔开，而且诗句的长度不同，所以正则表达式不能很好的将其提取出来，以及/p等符号，所以就把它们写入了.txt文件中。

因此我们在提取词云之前应该先处理那些多余的符号，在这里，我们用字符串的.replace方法将多余的符号换成" ",再进行提取。

这个就是我们提取出来的词云：

5、源码

import requests
import re

url = "https://so.gushiwen.cn/gushi/tangshi.aspx"
s = requests.get(url)
html = s.text
response = re.findall('.*?href="(.*?)".*?tar.*?>(.*?)(.*?)',html,re.S)
print(s)
for i in response:
    url1 = "https://so.gushiwen.cn"+str(i[0])
    name1 = requests.get(url1)
    html1 = name1.text
    s1 = re.findall('(.*?)<.*?>(.*?)',html1,re.S)

    with open("Tangshi.txt","a+",encoding="utf-8") as f:
        f.write(i[1])
        f.write("  ")
        f.write(i[2])
        f.write("n")
        f.write(s1[0][0])
        f.write("n")
        f.write(s1[0][1])

import wordcloud
import jieba

f = open("TS.txt","r",encoding="utf-8")

for i in f:
    a = i.replace("<","")
    b = a.replace(">","")
    c = b.replace("br","")
    d = c.replace("/","n")
    e = d.replace("p","")
    with open("TS.txt","a+",encoding="utf-8") as f:
        f.write(e)
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(font_path="msyh.ttc",width=1000,height=700,background_color="white",max_words=100)
w.generate(txt)
w.to_file("TS.png")

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5710922.html

唐诗三百首的爬取以及出现最多字数的统计

发表评论

评论列表（0条）