从pandas数据处理到知识图谱构建笔记_python

从pandas数据处理到知识图谱构建笔记一、数据处理

1.读取文件夹下所有姓名xls文件并整合:

import pandas as pd
import numpy as np
import re
dfs=[]
for root,sub,files in os.walk(r'C:\Users\Administrator\excel'):
    for file in files:
        if file.endswith("xls"):
            for i in range(2):
                #print(file)
                flie_name=os.path.join(root,file)
                df=pd.read_excel(flie_name,index=None,header=None,sheetname=i)
                df=df.drop(labels=0)
                #print(df)
                dfs.append(df)
    #print(dfs)
df_concat=pd.concat(dfs,ignore_index=True)

2.剔除第2列中两个字的人名中的空格

df1=df_concat.iloc[:,1]
stas=re.compile(' ')
df1=df1.astype(str).apply(lambda x:stas.sub('',x))
df1.drop_duplicates(inplace=True)
df1=df1.reset_index(drop=True)

3.将本表中与另一个表中相同的姓名剔除

dete=pd.read_excel("剔除.xlsx",header=None)
index56=[]
for named in dete.iloc[:,0]:
    for a in df1.index:
        if df1.iloc[a,0]==named:
            print(named)
            print(a)
            index56.append(a)
dvqc5=df1.drop(index56)            
dvqc5=dvqc5.reset_index(drop=True)

4.在专利表中查询出与本表中名字为发明人的专利信息

dv2334=pd.read_excel("C:/Users/Administrator/excel/专利.xlsx")
ds=[]
for name in dvqc5.iloc[:,0]:
    print(name)
    for index in dv2334.index:
        if dv2334.iloc[index,20:32]==name:
            ds.append(dv2334.iloc[index,:])
dsa=pd.DataFrame(ds)
dsa=dsa.reset_index()
dsa

5.根据本名单中名字所占专利发明人的数量排序

dsa_app=dsa.groupby('申请号').size().sort_values(ascending=False)
dsa_app=pd.DataFrame(dsa_app)
dsa_app=dsa_app.reset_index()
dsa2.to_excel("专利统计.xlsx",index=None)

二、知识图谱构建

neo4j安装教程参考：链接
将整理出的专利表中的专利分类号分为大类、中类、小类号。将大类、中类、小类以及专利名分别存为csv文件，共4个文件，放入neo4j安装下的import文件夹，准备导入neo4j成为节点。
这里只贴了大类节点构建，其他类似（以下代码在neo4j中运行）：

LOAD CSV WITH HEADERS  FROM "file:///大类.csv" AS line
MERGE (z:大类{name:line.大类})

构建中类与大类、小类与中类、专利名与小类的关系表（三列表：节点-关系-节点），存为csv，放到import文件夹。
构建关系（以下代码在neo4j中运行）：

LOAD CSV WITH HEADERS FROM "file:///关系.csv" AS line  
match (from:中类{name:line.中类}),(to:大类{name:line.大类})  
merge (from)-[r:所属大类{property:line.关系}]->(to)