【东南亚小语种项目】泰文文献双语标题和双语摘要爬取_随笔

【东南亚小语种项目】泰文文献双语标题和双语摘要爬取

0.介绍项目目标
1.网站分析
- 1.1 寻找文档的uri规律
- 1.2 寻找html规律
2.爬取 *** 作
- 2.1 多线程实现访问和爬取
- 2.2 html 处理
- 2.3 tsv转换
3.完整代码
4.结果展示

0.介绍项目目标

项目是基于东南亚小语种迁移学习所需要的数据集爬取任务，而目标选择定为小语种文献期刊双语标题和双语摘要进行对齐学习。
第一步先研究泰文数据集，目标文献期刊网站为：http://cuir.car.chula.ac.th/simple-search?query=
目标就是爬取这些信息，对每一条论文记录，框起来的那些都是要爬取的（泰语标题、英语标题、泰语摘要、英语摘要、学位类型），生成一个tsv文件，列名包括序号、泰语标题、英语标题、泰语摘要、英语摘要、学位类型

1.网站分析 1.1 寻找文档的uri规律

这是所有的文献集合，我们先随便选取一篇，发现地址栏十分复杂

# ps：【】表示连接的区别
# 第一篇文档
http://cuir.car.chula.ac.th/handle/123456789/
72181?
src=http://www.kaotop.com/skin/sinaskin/image/nopic.gif 第二篇文档
http://cuir.car.chula.ac.th/handle/123456789/
71235?
src=http://www.kaotop.com/skin/sinaskin/image/nopic.gif 第三篇文档
http://cuir.car.chula.ac.th/handle/123456789/
71233?
src=http://www.kaotop.com/skin/sinaskin/image/nopic.gif 最后一篇文档
http://cuir.car.chula.ac.th/handle/123456789/
27447?src=http://www.kaotop.com/skin/sinaskin/image/nopic.gif 但是经过多篇比对，我们发现每一篇文献的不同之处在于这两个地方：
http://cuir.car.chula.ac.th/handle/123456789/【72181】?
src=http://www.kaotop.com/skin/sinaskin/image/nopic.gif 而其他地方是一样的，这是否就是规律呢？先不着急下定论，
# 因为这个页面有一个return to list，我怀疑这是一个展示页而不是论文真正存放的物理路径。

我们往下看，发现一个叫做uri的属性
点击之后进入的居然是相同的页面，并且地址栏已经被改动
这个地址栏的规律不比刚刚那个好找？

第一篇：http://cuir.car.chula.ac.th/handle/123456789/72181
第二篇：http://cuir.car.chula.ac.th/handle/123456789/71235
第三篇：http://cuir.car.chula.ac.th/handle/123456789/71233
第四篇：http://cuir.car.chula.ac.th/handle/123456789/73078
第五篇：http://cuir.car.chula.ac.th/handle/123456789/69502
第六篇：http://cuir.car.chula.ac.th/handle/123456789/71254
第20篇：http://cuir.car.chula.ac.th/handle/123456789/77274
第60篇：http://cuir.car.chula.ac.th/handle/123456789/75905

最后一篇：http://cuir.car.chula.ac.th/handle/123456789/27447

检测：
http://cuir.car.chula.ac.th/handle/123456789/55
http://cuir.car.chula.ac.th/handle/123456789/77960
# 从55开始一直到77960都存在文献

通过二分法缩短范围，我们发现只有55-77960才有文献【截止至2021.12.10】
但是这个规律已经很好用了，只需要给一个循环就行。

1.2 寻找html规律

我们对该页面打开f12查看元素
不难发现，这个网站很容易爬取，内容都显示在td里，使用beautifulsoup的find方法：soup.find(class_=“metadataFieldValue dc_title”).string 就可以定位到这个字符串了。关于beautifulsoup的使用，自行学习。下面代码我会提供注释。
理论成立，开始写代码

2.爬取 *** 作 2.1 多线程实现访问和爬取

要注意，这里一共有7w+的uri，我们需要的是一次一次地去访问，效率太低，我们要充分利用cpu多核的优势去实现多线程，当然不要分太多线程，过多的线程如果cpu承载不了也是会降低效率，因为还是需要排队获取cpu时间片。我们采用稳妥一点的三线程。
下面直接提供代码，跟着代码注释从上往下看。

# 首先导入所有我们需要的包
import requests
import sys
import io
import threading
from bs4 import BeautifulSoup

# 这句直接复制粘贴即可，是处理python的io流读写文本时候的乱码问题，统一utf8【utf8有泰文集】，不然会受到gbk的影响
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

# 创建三个线程类进行3个函数调用，这是threading模块的用法，可以自行学习后回来看
# 要注意这里是class而不是一个方法，可以照抄就行，重点是class里面的一个run方法，
# run方法需要调用我们执行函数，比如我这里调用的就是我待会要执行爬取功能的函数processdocument
class thread1(threading.Thread):
    def __init__(self, threadID, name, counter):    
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        # 55,26023表示我们要爬取的范围，1表示这是第一个线程，具体看下面的函数形参
        processdocument(55,26023,1)
        print ("退出线程：" + self.name)
class thread2(threading.Thread):
    def __init__(self, threadID, name, counter):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        processdocument(26023,51991,2)
        print ("退出线程：" + self.name)
class thread3(threading.Thread):
    def __init__(self, threadID, name, counter):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        processdocument(51991,77961,3)
        print ("退出线程：" + self.name)
    

# 爬取函数，我只对一个进行注释，后面两个是重复的函数，可以直接复用，但是注意num形参的不同，thread_no表示当前线程序号
# num1，num2是指我们要爬取的序列号范围，我们从分析网站之后得到文档的有效序号是55-77960，
# 那么我们分三个线程，均分这一批文献，分别是55,26023 | 26023,51991 | 51991,77961。
# 因为range函数是左闭右开，所以最后要加一77961.
def processdocument(num1，num2,thread_no):
    for serial_num in range(num1，num2):
    	# 这两个变量可以自己加上去之后自己打印输出查看结果
        fail_count = 0
        success_count = 0
        document_URL = 'http://cuir.car.chula.ac.th/handle/123456789/'+str(serial_num)
        try:
        	# 这里一定要用try except语块，否则会因为爬取连接过多导致一些异常而终止。
            response = requests.get(document_URL, timeout=2)
            # 获取状态码，状态码为200的时候访问成功
            status = response.status_code
            # 如果访问成功则写入文件
            if status == 200:
                # print(document_URL)
                success_count += 1
				# with打开文件后，用a模式追加字符encoding='utf-8'
				# 结果我们可以看到多个文件
            	with open("uri"+str(thread_no)+".txt","a") as file:
                    file.write(document_URL+"n")
        except:
            fail_count += 1
    print(success_count)

# 创建线程对象，thread1，thread2，thread3都是刚刚定义的类名
thread_1 = thread1(1, "Thread-1", 1)
thread_2 = thread2(2, "Thread-2", 2)
thread_3 = thread3(3, "Thread-3", 3)

# 开启线程,自动调用run函数
thread_1.start()
thread_2.start()
thread_3.start()

爬取好后差不多长这样，当然我们需要的不是这些uri，而是uri对应的内容，这里只是简单查看一下结果，所以爬取一会儿就可以停止了。

2.2 html 处理

我们使用beautifulsoup模块，请自行学习。pip下载该模块后可以直接导入和使用。

def processdocument(num1，num2,thread_no):
    for serial_num in range(num1，num2):
    	# 这两个变量可以自己加上去之后自己打印输出查看结果
        fail_count = 0
        success_count = 0
        document_URL = 'http://cuir.car.chula.ac.th/handle/123456789/'+str(serial_num)
        try:
        	# 这里一定要用try except语块，否则会因为爬取连接过多导致一些异常而终止。
            response = requests.get(document_URL, timeout=2)
            # 获取状态码，状态码为200的时候访问成功
            status = response.status_code
            # 如果访问成功则写入文件
            if status == 200:
                # print(document_URL)
                success_count += 1
                # 成功了我们就不写入查看了，我们直接把访问成功的网站的html文档传过去
                # 这里写txt也可以，只是我们最后的目标是tsv文件
                # response.text是uri网站对应的html代码，可以自行打印查看
                processSoup('数据集1.tsv',response.text)
        except:
            fail_count += 1
    print(success_count)
    
# 有了uri，我们就可以不断进行自动访问，然后下载html进行元素内容精准定位和获取了
# 我们定义一个函数用于处理html
# filename是我们要存放结果的文件，html_text是我们获取到的html文本
def processSoup(filename,html_text):
	# soup对象的创建需要一个html文本，和一个解析器，我们使用默认提供的lxml即可
    soup = BeautifulSoup(html_text,'lxml')
   
    # 下面字段获取内容的方式大同小异，自行查看soup.find()函数，我们这里只用到find函数的两个字段，
    # soup.find('标签名',class_='类名')，这样就可以唯一定位到我想要的内容，
    # 但是我们需要的是标签中间的字符串而不包括标签，所以我们还需要进行.string *** 作。至于内容在哪个类里，自己去网页f12查看元素即可。
    # ps：为什么是class_?而不是class=？因为class是python的关键字，soup为了区分而换了一个关键字代表html的类。
    title = soup.find('td',class_='metadataFieldValue dc_title').string
    other_title = soup.find('td',class_='metadataFieldValue dc_title_alternative').string
    abstract = soup.find('td',class_='metadataFieldValue dc_description_abstract').string
    other_abstract = soup.find('td',class_='metadataFieldValue dc_description_abstractalternative').string
    degree_discipline = soup.find('td',class_='metadataFieldValue dc_degree_discipline').a.string
	
	# 一般来说，网页title就是泰文标题，other_title就是英文标题，但是在爬取过程中我们发现了存在title是英文标题先的，而other_title才是泰文标题，因此我们需要加一个过滤方法
    thai_title = title
    en_title = other_title
    thai_abstract = abstract
    en_abstract = other_abstract

	# 过滤方法，保证title就是泰文标题，other_title就是英文标题
    for i in range(10):
    	# 提供一个随机数
        index = random.randint(1,20)
        # 简单排除停止词，如果我们随机的字符是停止词，我们重新提供随机数
        if(thai_title[index] == ',' or thai_title[index] == ' ' or thai_title[index] == '.' or thai_title[index] == '(' or thai_title[index] == ')' or thai_title[index] == '?' or thai_title[index] == '!'):
            continue
        else:
        	# 如果我们随机到的字符不是停止词，那么判断它是不是英文，注意还要考虑大小写，小写英文的ascii是97-122，而大写是65-80。中间一些其他的ascii可以忽略
            if(ord(title[1])>=65 and ord(title[1])<=122):
            	# 如果该字符是英文，则title是特殊情况，也就是title是英文标题，
            	# other_title才是泰文标题，我们进行简单转换
                thai_title = other_title
                en_title = title
                thai_abstract = other_abstract
                en_abstract = abstract
            break

   # 把爬取下来的字段都写入到文件中
    with open(filename,'a',encoding='utf-8') as file:
        file.write(thai_title+",")
        file.write(en_title+",")
        file.write(thai_abstract+",")
        file.write(en_abstract+",")
        file.write(degree_discipline+"n")
    return

2.3 tsv转换

因为我们目标是一个tsv文件，也就是分隔符是tab键。我们选择使用python自带的csv包进行改写

    # 需要在开头导入一个包，import csv
	# 这里需要加一个参数newline=''，不然会有诡异的换行
    with open(filename,'a',encoding='utf-8',newline='') as file:
        tsv_w = csv.writer(file,delimiter='t')
        #tsv_w.writerow(['thai_title','en_title','thai_abstract','en_abstract','degree_discipline'])
        tsv_w.writerow([thai_title,en_title,thai_abstract,en_abstract,degree_discipline])
        #file.write(thai_title+",")
        #file.write(en_title+",")
        #file.write(thai_abstract+",")
        #file.write(en_abstract+",")
        #file.write(degree_discipline+"n")

3.完整代码

# 首先导入所有我们需要的包
import requests
import sys
import io
import csv
import threading
from bs4 import BeautifulSoup

# 这句直接复制粘贴即可，是处理python的io流读写文本时候的乱码问题，统一utf8【utf8有泰文集】，不然会受到gbk的影响
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

# 创建三个线程类进行3个函数调用，这是threading模块的用法，可以自行学习后回来看
# 要注意这里是class而不是一个方法，可以照抄就行，重点是class里面的一个run方法，
# run方法需要调用我们执行函数，比如我这里调用的就是我待会要执行爬取功能的函数processdocument
class thread1(threading.Thread):
    def __init__(self, threadID, name, counter):    
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        # 55,26023表示我们要爬取的范围，1表示这是第一个线程，具体看下面的函数形参
        processdocument(55,26023,1)
        print ("退出线程：" + self.name)
class thread2(threading.Thread):
    def __init__(self, threadID, name, counter):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        processdocument(26023,51991,2)
        print ("退出线程：" + self.name)
class thread3(threading.Thread):
    def __init__(self, threadID, name, counter):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        print ("开始线程：" + self.name)
        processdocument(51991,77961,3)
        print ("退出线程：" + self.name)
    

# 爬取函数，我只对一个进行注释，后面两个是重复的函数，可以直接复用，但是注意num形参的不同，thread_no表示当前线程序号
# num1，num2是指我们要爬取的序列号范围，我们从分析网站之后得到文档的有效序号是55-77960，
# 那么我们分三个线程，均分这一批文献，分别是55,26023 | 26023,51991 | 51991,77961。
# 因为range函数是左闭右开，所以最后要加一77961.
def processdocument(num1，num2,thread_no):
    for serial_num in range(num1，num2):
    	# 这两个变量可以自己加上去之后自己打印输出查看结果
        fail_count = 0
        success_count = 0
        document_URL = 'http://cuir.car.chula.ac.th/handle/123456789/'+str(serial_num)
        try:
        	# 这里一定要用try except语块，否则会因为爬取连接过多导致一些异常而终止。
            response = requests.get(document_URL, timeout=2)
            # 获取状态码，状态码为200的时候访问成功
            status = response.status_code
            # 如果访问成功则写入文件
            if status == 200:
                success_count += 1
                processSoup('数据集'+thread_no+'.tsv',response.text)
				
        except:
            fail_count += 1
    print(success_count)

# 有了uri，我们就可以不断进行自动访问，然后下载html进行元素内容精准定位和获取了
# 我们定义一个函数用于处理html
# filename是我们要存放结果的文件，html_text是我们获取到的html文本
def processSoup(filename,html_text):
	# soup对象的创建需要一个html文本，和一个解析器，我们使用默认提供的lxml即可
    soup = BeautifulSoup(html_text,'lxml')
   
    # 下面字段获取内容的方式大同小异，自行查看soup.find()函数，我们这里只用到find函数的两个字段，
    # soup.find('标签名',class_='类名')，这样就可以唯一定位到我想要的内容，
    # 但是我们需要的是标签中间的字符串而不包括标签，所以我们还需要进行.string *** 作。至于内容在哪个类里，自己去网页f12查看元素即可。
    # ps：为什么是class_?而不是class=？因为class是python的关键字，soup为了区分而换了一个关键字代表html的类。
    title = soup.find('td',class_='metadataFieldValue dc_title').string
    other_title = soup.find('td',class_='metadataFieldValue dc_title_alternative').string
    abstract = soup.find('td',class_='metadataFieldValue dc_description_abstract').string
    other_abstract = soup.find('td',class_='metadataFieldValue dc_description_abstractalternative').string
    degree_discipline = soup.find('td',class_='metadataFieldValue dc_degree_discipline').a.string
	
	# 一般来说，网页title就是泰文标题，other_title就是英文标题，但是在爬取过程中我们发现了存在title是英文标题先的，而other_title才是泰文标题，因此我们需要加一个过滤方法
    thai_title = title
    en_title = other_title
    thai_abstract = abstract
    en_abstract = other_abstract

	# 过滤方法，保证title就是泰文标题，other_title就是英文标题
    for i in range(10):
    	# 提供一个随机数
        index = random.randint(1,20)
        # 简单排除停止词，如果我们随机的字符是停止词，我们重新提供随机数
        if(thai_title[index] == ',' or thai_title[index] == ' ' or thai_title[index] == '.' or thai_title[index] == '(' or thai_title[index] == ')' or thai_title[index] == '?' or thai_title[index] == '!'):
            continue
        else:
        	# 如果我们随机到的字符不是停止词，那么判断它是不是英文，注意还要考虑大小写，小写英文的ascii是97-122，而大写是65-80。中间一些其他的ascii可以忽略
            if(ord(title[1])>=65 and ord(title[1])<=122):
            	# 如果该字符是英文，则title是特殊情况，也就是title是英文标题，
            	# other_title才是泰文标题，我们进行简单转换
                thai_title = other_title
                en_title = title
                thai_abstract = other_abstract
                en_abstract = abstract
            break

   # 把爬取下来的字段都写入到文件中
   # 这里需要加一个参数newline=''，不然会有诡异的换行
   with open(filename,'a',encoding='utf-8',newline='') as file:
       tsv_w = csv.writer(file,delimiter='t')
       #tsv_w.writerow(['thai_title','en_title','thai_abstract','en_abstract','degree_discipline'])
       tsv_w.writerow([thai_title,en_title,thai_abstract,en_abstract,degree_discipline])
       #file.write(thai_title+",")
       #file.write(en_title+",")
       #file.write(thai_abstract+",")
       #file.write(en_abstract+",")
       #file.write(degree_discipline+"n")
   return

# 创建线程对象，thread1，thread2，thread3都是刚刚定义的类名
thread_1 = thread1(1, "Thread-1", 1)
thread_2 = thread2(2, "Thread-2", 2)
thread_3 = thread3(3, "Thread-3", 3)

# 开启线程,自动调用run函数
thread_1.start()
thread_2.start()
thread_3.start()

4.结果展示

一共会有三个这样的文件，直接首尾拼接就可以作为迁移学习的一个对齐数据集了

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5657920.html

【东南亚小语种项目】泰文文献双语标题和双语摘要爬取

发表评论

评论列表（0条）