从爬虫到检索的求职搜索引擎本地简单实现，含界面和特色功能（附数据分析代码）_python

因为这个功能实现在第五部分，所以标题是5开头，比较多就懒得改了，大家明白就好

注意：本资源仅共学习参考，任何人切勿以任何形式商用！！！尊重信息版权
然后开头申明，本文包含的情感分类及公司排序不涉及真实排序，请勿当真，纯粹随机选的评论= =，但是都是大厂这没毛病。

温馨提示，本文虽然涉及到爬虫，但是因为博主可以说非常不喜欢爬虫，并且只是出入门户就懒得理解了，不过代码注释都非常详细，大家想要注重理解爬虫的还是找专业的吧，我这里爬虫都是一笔带过而已。当然也可以结合代码注释来理解，毕竟流程都一样。

还是老规矩，如果对大家有所帮助希望可以点赞，收藏，关注，但是这次代码不可以放上来了（害怕）

这里涉及的情感分类在我前面的博客有细致的讲解，大家最好结合前面的一起食用，我做的东西一般都是贯穿性的，因为可以加深理解，并且高效。
Bert细致讲解

5.1 核心功能实现流程描述；

首先是信息的筛选，通过爬取lp和某直聘的Python相关职位所有信息（北京和上海两个城市），将所爬取的信息作为形成数据库的基石。这里我共实现了两个，一个是lp，满足所有验收要求，一个是某直聘，满足搜索引擎（上述app）的功能要求。

1.网站1
爬取两个城市共400个职位链接->爬取详情信息->分词->数据清洗->去停->词频统计->去除无意义词汇与信息（例如单个的字）->通过TD-IDF提取关键词并生成词云

2.网站2
爬取两个城市近4K的职位数据（包括薪资，地点，所需技能，工作经验要求，学历要求以及岗位名称）->城市平均薪资，经验要求，技能需求以及学历需求的统计->技能词云

Top10公司评价统计（自己找的）->公司介绍爬取->公司总技能统计->信息树搭建->通过bert预训练的情感分析任务在评价上作zero-shot测评->根据评价得分rank公司并构建简单的关系图->搭建简易网页，实现本地搜索功能

其中Top10公司评价是自己找的原因是没有地方可以爬取，因为只有在专业求职app下面一定是夸这个公司的，不真实，但又没办法大规模爬，因为无法判别是不是评判公司的信息，所以只能我自己去找一个公司10个评价，根据predictions累计分数得到公司对应总分。

5.2 爬取流程分析；

其实感觉这个部分没有什么好讲的，我是直接套的网上的流程，然后找的对应的代码进行网页的爬取：
lp用的是request，首先在网站上输python，把网址复制下来，并且获取地区编号，找到规律一个地区爬了200个就换一个。然后用在代理池里面得到的ip作为user-agent去爬具体信息（不怕封就直接一直爬不用sleep）并存入job.txt文件待用。
直聘用的是webdriver，这个有些不一样，因为我不需要具体信息，所以爬的是python页面中的一些基本信息以及tag（技能），可以通过正则式获取，这一页爬完了就下一页，但是防封爬一页sleep 5s，一个城市爬够了就换下一个即可，request要url代号，这个只需要把对应的标签记下来就行了，爬取的信息放入数据.csv待用。

注：Boss的这个爬虫是我在网上找的改的，且只是微调，但太多一样的都标原创，我也没办法给出链接，大家一找就可以找到了，这里我只使用了他的数据做分析。

5.3 解析流程分析；

5.3.1 网站1数据：

其实这个就跟上面的流程差不多，要求的数据分析太简单了都不知道要讲什么，就大致把我写的代码过一下吧：
5.3.1.1 首先把刚才的job.txt文件读入，字符串除特殊符号相信大家都会：

说起来这个代码还是我刚自学python的时候写的。

5.3.1.2 然后分词去停，用jieba自带的lcut函数就可以了，去停对着停用词标筛就ok了，至于一些无意义词的筛选，特殊的也可以仍停用词表里去，单个或者超长的词直接删就行了。

5.3.1.3 最后就是提取关键词生成词语咯，这个也非常简单，只是我发现普通的tf-idf并不适我们爬取的文档，首先因为我忘记分类了（全存一个文件里去了…），其次就是用了之后发现rerank差别不大，所以为了更显著，我们要拉高idf的重要性。不过其实每个文档的信息差不多，所以我们把job文件等分为400份就行了（爬了400个工作），一共8w个词，所以我们200个词仍一个文档就好，然后统计逆词频

可以看到就是一个文档里面多个词只统计一次咯，不过注意我们说了要提高权重，所以与普通的idf比，我们把计算的分母乘2：

5.3.1.4 然后就可以rerank得到我们想要的结果了（并生成词云）
生成词云更可以直接套了，先根据词频对应随机颜色（温度和词频还是要有关更好）

可以看到左边和右边不一样，左边的是用寻常idf，右边是加大idf权重的结果，可以看到右边的一些技能就出来了

5.3.2 Boss

接下来算是重头戏了，这一部分是为了实现我搜索引擎的功能：
上面说了我们爬取到的信息已经存入数据.csv,如图所示：

然后我们通过预训练好的Bert情感分类模型对top10公司评价进行得分计算：

这个就不贴代码了，也没必要，把我预训练的精度图贴上来吧：

zero-shot的结果排名：

然后结合我们爬取的top10公司的介绍就可以建造如下的存储字典：

如图，这个存储结构就可以非常方便我们对应索引（虽然都是某些类别的问题）

首先我们识别的一个功能就是判断公司的好坏，我们通过每个公司对应的score来排序，实现关系，比如你问：有什么比阿里好的公司？

然后就可以在output中显示score比他高的公司简介

这里我们统一同类词，只写了几个，其他相统一都是类似的。

如上所示，统一同样的词，方便判断（其实也可以直接判断是否在列表里面）

具体实现方式其实比较简单，就是把查询的内容先分词，

然后关键词对应问题。

还有一个就是找还不错的公司，也是一样的：

前5认为好，后面的认为不好，然后返回的都是这些公司的简介

还有的功能就是对于职位需求的检索，包括对于地点，公司，具体职位，具体技能的识别与检索，代码如下：

其他的还可以实现很多功能，都是类似的，基于词语本身的，就不多写了。

然后我还实现了一个简单的网页：

这里方便起见，读文件是通过下载搜索框内的文本，然后运行我们的py脚本，就会生成output，通过文件选取和显示就可以在网页中显示了。

可以看到就在网页上显示咯，因为我不怎么会html，没办法切网页显示，所以就把所有返回的结果全扔到output里面一起显示啦。

还可以推荐工作哦，反正上面所提到过的问题基本都可以实现检索返回。

完整代码：

1.网站1内容分析：

import jieba
import matplotlib.pyplot as plt
import wordcloud
import random
import math


# 获得去除标点的文本
def get_text(file_name):
    with open(file_name, 'r', encoding='utf-8') as fr:
        text = fr.read()
        # 要删除的标点
        for ch in '|\/*-+.!@#$%^&*()''/.,，?><"~`。、“”（）《》':
            text = text.replace(ch, '')  # 这里无需替换成空格
        return text


def stopwordslist(filepath):  # 定义函数创建停用词列表
    stopword = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  # 以行的形式读取停用词表，同时转换为列表
    return stopword


file_name = 'job.txt'
text = get_text(file_name)
stopwords = stopwordslist('./停用词.txt')  # 用的是百度和哈工大的中英文停用词表
# print(stopwords)
vlist = jieba.lcut(text)  # 调用jieba实现分词，返回列表
# print(len(vlist)) # 81701个词


res_dict = {}
# 进行词频统计
counts = 0
for i in vlist:
    if i not in stopwords:
        res_dict[i.title()] = res_dict.get(i, 0) + 1
    counts += 1


# 每200个词为一个文档，统计每个词出现的文档次数
def split_list_average_n(origin_list, n):
    for i in range(0, len(origin_list), n):
        yield origin_list[i:i + n]


idf_list = split_list_average_n(vlist, 400)
# for i in idf_list:
#    print(i)


res_list = list(res_dict.items())
# 降序排序
res_list.sort(key=lambda x: x[1], reverse=True)
fin_res_list = []

# 去除单个字的词
for item in res_list:
    if (len(item[0]) >= 2):
        fin_res_list.append(item)

# print(len(fin_res_list)) # 4730
idf_dict = {}
for i in idf_list:
    for j in fin_res_list:
        if j[0] in i:
            if j[0] in idf_dict.keys():
                idf_dict[j[0]] += 1
            else:
                idf_dict[j[0]] = 1

print(idf_dict)
print(len(idf_dict))

for i in idf_dict.keys():
    idf_dict[i] = math.log(200 / (idf_dict[i] + 1))


counts = 0
for i in fin_res_list:
    if i[0] in idf_dict.keys():
        idf_dict[i[0]] = idf_dict[i[0]] * i[1]  # tf-idf
        # idf_dict[i[0]] = i[1] / idf_dict[i[0]]

idf_dict = sorted(idf_dict.items(), key=lambda x: x[1], reverse=True)  # 降序排列
print(idf_dict)

for i in range(50):
    word, count = idf_dict[i]
    pstr = str(i + 1) + ':'
    print(pstr, end=' ')
    print(word, count)


# (1)自定义颜色函数
def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
    # h = randint(0, 255)  # 从0-255取值，分别是从红→橙→黄→绿→青→蓝→紫
    # 51-70间为较亮的难以看清的黄色，0-15的红色也过于注目，改成随机在 16-50和 71-255间取值
    h = random.choice((random.randint(16, 50), random.randint(71, 255)))
    s = int(100.0 * 255.0 / 255.0)
    l = int(100.0 * float(random.randint(60, 120)) / 255.0)
    string = "hsl({}, {}%, {}%)".format(h, s, l)
    print(string)
    return string


# (2)导入背景图
back_pic = plt.imread("wordcloud_backgroud.jpg")  # 背景图

# (3)生成词云对象
wc = wordcloud.WordCloud(scale=4,  # scale越大，分辨率越高，越清晰
                         font_path='C:\Windows\Fonts\simhei.ttf',  # 设置字体，Windows系统自带了Fonts文件夹
                         background_color="white",  # 背景颜色
                         max_words=200,  # 词云显示的最大词数
                         mask=back_pic,  # 设置背景图片
                         max_font_size=250,  # 字体最大值
                         random_state=42,
                         color_func=random_color_func)

word_list = []
for i in idf_dict:
    for j in range(int(i[1])):
        word_list.append(i[0])  # 只留词,但是次数复制

random.shuffle(word_list)
print(" ".join(word_list))
my_wordcloud = wc.generate(" ".join(word_list))

plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
wc.to_file('./image/词云_猎聘.png')  # 保存图片文件
plt.close()

2.网站2内容分析+搜索引擎功能实现：

import csv
import os
import jieba
import eel
import json
import pandas as pd

with open("C:\Users\72959\Downloads\save.txt", 'r', encoding='utf-8') as fr:
    question = fr.read()

# 把网页读取的用户问题分词
question_words = jieba.lcut(question)
print(question_words)
for i in os.walk("./top10公司介绍/"):
    cp_names = i[2]

# print(cp_names)

# 现在我们应该来构建十个公司的关系以及字典
# 信息字典
infr_dict = {}

for i in cp_names:
    with open("./top10公司介绍/" + i, 'r', encoding='utf-8') as fr:
        score = fr.readline()
        introduction = fr.read()
        infr_dict[i.strip('.txt')] = {}
        infr_dict[i.strip('.txt')]['score'] = score.strip('\n')
        infr_dict[i.strip('.txt')]['introduction'] = introduction
        infr_dict[i.strip('.txt')]['job'] = {}
        infr_dict[i.strip('.txt')]['skill'] = []

# save_file = '信息字典.csv'
# with open(save_file, 'w', encoding='utf8') as f:
#   json.dump(infr_dict, f, ensure_ascii=False, indent=4)
# print(infr_dict)

# data包括：岗位0，地点1，薪资2，工作经验3，学历4，公司5，技能6
# data = pd.read_csv('./数据.csv')
# print(data.company)
# 这里我们要把信息存进去，所以对着数据的公司来找

with open('./数据.csv', encoding='utf-8') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)

    for i in f_csv:
        # print(i)
        if i[5] in infr_dict.keys():
            infr_dict[i[5]]['job'][i[0]] = {}
            infr_dict[i[5]]['job'][i[0]]['skill'] = i[6]
            infr_dict[i[5]]['job'][i[0]]['education'] = i[4]
            infr_dict[i[5]]['job'][i[0]]['experience'] = i[3]
            infr_dict[i[5]]['job'][i[0]]['salary'] = i[2]
            infr_dict[i[5]]['job'][i[0]]['location'] = i[1][:2]
            # 添加总技能 并去重

            i[6] = i[6].strip('[')
            i[6] = i[6].strip(']')
            i[6] = i[6].split(",")
            for j in i[6]:
                if j not in infr_dict[i[5]]['skill']:
                    infr_dict[i[5]]['skill'].append(j)


# print(infr_dict.keys())
"""
for i in infr_dict.keys():
    print(i)
    print(infr_dict[i]['score'])
"""
# 现在我们字典创建好了，现在就是要创建关系了
# 第一个关系就是公司的好坏，我们用上面情感分析的结果，也在字典里面的score里先排个序，说实话，懒得排序了，这里十个我直接手排了哈哈哈哈，用代码也比较简单，c++里面的冒泡就是典型的
# 这里排序不代表真实排名，因为是我自己随机取的10个数据
rank_list = ['腾讯', '蚂蚁集团', '阿里', 'bilibili', '滴滴', '携程', '拼多多', '字节跳动', '美团', '饿了么']

good_list = ['好', '很好', 'good', '比较好', '非常好']
bad_list = ['差', '很差', 'bad', '比较差', '非常差', '坏']
ali = ['阿里', '阿里巴巴', '阿里集团']
bili = ['bilibili', 'b站', 'B站', 'bili', '小破站']
eleme = ['饿了么', '饿', '饿了吗']
# 以此类推，懒得标了
# 先把question_word的词汇清洗转换一下
for i in question_words:
    if i in good_list:
        question_words[question_words.index(i)] = '好'
    if i in bad_list:
        question_words[question_words.index(i)] = '差'
    if i in ali:
        question_words[question_words.index(i)] = '阿里'
    if i in bili:
        question_words[question_words.index(i)] = 'bilibili'
    if i in eleme:
        question_words[question_words.index(i)] = '饿了么'

f = open('ouput.txt', 'w', encoding='utf-8')

# 完成公司比较的检索：
if '公司' in question_words and '比' in question_words:
    if '好' in question_words:
        for i in question_words:
            if i in rank_list:
                # print(rank_list[:rank_list.index(i)])
                for j in rank_list[:rank_list.index(i)]:
                    f.write(infr_dict[j]['introduction'])
                    f.write("\n\n\n")
    if '差' in question_words:
        for i in question_words:
            if i in rank_list:
                # print(rank_list[rank_list.index(i) + 1:])
                for j in rank_list[rank_list.index(i) + 1:]:
                    f.write(infr_dict[j]['introduction'])
                    f.write("\n\n\n")

# 好和不好的公司
if '公司' in question_words and '比较' in question_words:
    if '好' in question_words:
        # print(rank_list[:5])
        for i in rank_list[:5]:
            f.write(infr_dict[i]['introduction'])
            f.write("\n\n\n")
    if '差' in question_words:
        # print(rank_list[6:])
        for i in rank_list[6:]:
            f.write(infr_dict[i]['introduction'])
            f.write("\n\n\n")

skill_list = []
s = []
# 职位信息的搜索,则遍历职位信息
counts = 0
for i in question_words:
    if i in rank_list:
        for j in infr_dict[i]['job'].keys():
            if len(list(set(jieba.lcut(j.casefold())) & set(question_words))) == len(jieba.lcut(j)):  # 工作对
                if '上海' in question_words or '北京' in question_words:
                    if infr_dict[i]['job'][j]['location'] in question_words:
                        if '技能' in question_words:
                            # print(infr_dict[i]['skill'])
                            skill_list.append(infr_dict[i]['skill'])
                            # f.write(str(infr_dict[i]['skill']))
                            break
                        else:
                            # print(infr_dict[i]['job'][j])
                            # skill_list.append(infr_dict[i]['job'][j])
                            f.write(str(infr_dict[i]['job'][j]))  # 全部输出
            else:
                if '上海' in question_words or '北京' in question_words:
                    if infr_dict[i]['job'][j]['location'] in question_words:
                        if '技能' in question_words:
                            skill_list.append(infr_dict[i]['skill'])
                            f.write(str(infr_dict[i]['skill']))


    counts += 1
    if counts == 10:
        for item in skill_list:
            for k in item:
                s.append(k)
        # print(set(s))
        f.write(str(set(s)))

counts = 0
for i in infr_dict.keys():
    for j in infr_dict[i]['job'].keys():
        if len(list(set(jieba.lcut(j.casefold())) & set(question_words))) == len(jieba.lcut(j)):  # 工作对
            if '上海' in question_words or '北京' in question_words:
                if infr_dict[i]['job'][j]['location'] in question_words:
                    if '技能' in question_words:
                        # print(infr_dict[i]['skill'])
                        skill_list.append(infr_dict[i]['skill'])
                        # f.write(str(infr_dict[i]['skill']))
                        break
                    else:
                        # print(infr_dict[i]['job'][j])
                        # skill_list.append(infr_dict[i]['job'][j])
                        f.write(str(infr_dict[i]['job'][j]))  # 全部输出
        else:
            if '推荐' in question_words and '工作' in question_words:
                # print(infr_dict[i]['job'][j])
                # skill_list.append(infr_dict[i]['job'][j])
                f.write(str(infr_dict[i]['job'].keys()))  # 全部输出
    counts += 1
    if counts == 10 and s != []:
        for item in skill_list:
            for k in item:
                s.append(k)
        print(set(s))
        f.write(str(set(s)))
# 后面其他功能也是以此类推，在这里就不写了

# print(infr_dict)

3.搜索引擎HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">

    <title>搜索引擎</title>

    <style>
        img{
            position:absolute;
            left:450px;
            top:120px;
        }


        body,ul{
            margin:0;/*清除默认标签*/
        }
        ul{
            padding-left:0;
            list-style:none;
        }
        textarea{
            display:block;
            width:640px;
            height:30px;
            margin:100px auto 0;
            font-size: 25px;
        }
        .btn{
            width:500px;
            margin:10px auto;
            text-align:center;
            font-size: 25px;
        }
        .msg{
            margin:0 auto;
            width:500px;
        }
        .msglist{
            line-height:50px;
            border-bottom:1px dashed #ccc;
            text-indent: 2em;/*开头空两个字符*/
        }
    </style>
</head>
<body>

    <body background="./你的名字带logo.png" style=" background-repeat:no-repeat ;background-size:100% 100%;
		background-attachment: fixed;">
            <br>
			<br>
			<br>
			<br>
            <br>
			<br>
			<br>
			<br>
            <br>
			<br>

    <textarea cols="30" rows="10"></textarea>

    <div class="btn">
        <button>搜索</button>
    </div>
    <ul class="msg"></ul>


<script script type="text/javascript">
function show()
{
    var reader = new FileReader();
    reader.onload = function()
    {
        alert(this.result)
    }
    var f = document.getElementById("filePicker").files[0];
    reader.readAsText(f);
}
</script>
<input type="file" name="file" id="filePicker"/>
<div id = "div1">
</div>
<input type="button" value = "显示"  onclick="show()"/>



    <script>
        function fake_click(obj) {
    var ev = document.createEvent("MouseEvents");
    ev.initMouseEvent(
        "click", true, false, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null
    );
    obj.dispatchEvent(ev);
}

    function download(name, data) {
    var urlObject = window.URL || window.webkitURL || window;

    var downloadData = new Blob([data]);

    var save_link = document.createElementNS("http://www.w3.org/1999/xhtml", "a")
    save_link.href = urlObject.createObjectURL(downloadData);
    save_link.download = name;
    fake_click(save_link);
    }


        let btn=document.querySelector("button"),//获得button元素
            textArea=document.querySelector("textarea"),
            msg=document.querySelector(".msg");
        btn.onclick=function(){
            if(textArea.value){
                //调用方法
                download("save.txt",textArea.value);
                //msg.innerHTML+=""+textArea.value+"
";//将标签信息添加到ul中
                //textArea.value="";//清空输入框
            }else{
                alert("你尚未输入信息,请重新输入")
            }
        }

    </script>
    <script src="https://cdn.bootcss.com/jquery/3.3.1/jquery.min.js"></script>
    <script type="text/javascript">
        var openFile = function(event) {
            var input = event.target;
            var reader = new FileReader();
            reader.onload = function() {
                if(reader.result) {
                    //显示文件内容
                    $("#output").html(reader.result);
                }
            };
            reader.readAsText(input.files[0]);
        };
    </script>

</body>
</body>
</html>

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/797472.html

从爬虫到检索的求职搜索引擎本地简单实现，含界面和特色功能（附数据分析代码）

发表评论

评论列表（0条）