Python 多线程实现爬取妹子图

Python 多线程实现爬取妹子图,第1张

概述前阵子网上看到有人写爬取妹子图的派森代码,于是乎我也想写一个教程,很多教程都是调用的第三方模块,今天就使用原生库来爬,并且扩展实现了图片鉴定,图片去重等 *** 作,经过了爬站验证,稳如老狗,我已经爬了几万张

前阵子网上看到有人写爬取妹子图的派森代码,于是乎我也想写一个教程,很多教程都是调用的第三方模块,今天就使用原生库来爬,并且扩展实现了图片鉴定,图片去重等 *** 作,经过了爬站验证,稳如老狗,我已经爬了几万张了,只要你硬盘够大。

妹子图网站被扒倒闭了,下面的代码只能参考了。

前端,被一个 img标签包起来 <img src="https://mtl.gzhuibei.com/images/img/10431/5.jpg" alt= 直接正则匹配

先来生成页面链接,代码如下

# 传入参数,对页面进行拼接并返回列表def SplicingPage(page,start,end):    url = []    for each in range(start,end):        temporary = page.format(each)        url.append(temporary)    return url

接着使用内置库爬行

# 通过内置库,获取到页面的URL源代码def GetPageURL(page):    head = GetUserAgent(page)    req = request.Request(url=page,headers=head,method="GET")    respon = request.urlopen(req,timeout=3)    if respon.status == 200:        HTML = respon.read().decode("utf-8")        return HTML

最后正则匹配爬取,完事了。代码自己研究一下就明白了,太简单了,

    page_List = SplicingPage(str(args.url),2,100)    for item in page_List:            respon = GetPageURL(str(item))            subject = re.findall('<img src="([^"]+\.jpg)"',respon,re.S)            for each in subject:                img_name = each.split("/")[-1]                img_type = each.split("/")[-1].split(".")[1]                save_name = str(random.randint(1111111,99999999)) + "." + img_type                print("[+] 原始名称: {} 保存为: {} 路径: {}".format(img_name,save_name,each))                urllib.request.urlretrIEve(each,None)

也可以通过外部库提取。

from lxml import etree HTML = etree.HTML(response.content.decode())src_List = HTML.xpath('//ul[@ID="pins"]/li/a/img/@data-original')alt_List = HTML.xpath('//ul[@ID="pins"]/li/a/img/@alt')

一些请求头信息,用于绕过反爬虫策略

    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML,like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (windows; U; windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML,"Mozilla/5.0 (windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 firefox/38.0","Mozilla/5.0 (windows NT 10.0; WOW64; TrIDent/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko","Mozilla/5.0 (compatible; MSIE 9.0; windows NT 6.1; TrIDent/5.0)","Mozilla/4.0 (compatible; MSIE 8.0; windows NT 6.0; TrIDent/4.0)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 6.0)","Mozilla/4.0 (compatible; MSIE 6.0; windows NT 5.1)","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 firefox/4.0.1","Mozilla/5.0 (windows NT 6.1; rv:2.0.1) Gecko/20100101 firefox/4.0.1","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11","Opera/9.80 (windows NT 6.1; U; en) Presto/2.8.131 Version/11.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML,like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; Maxthon 2.0)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; TencentTraveler 4.0)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; The World)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; TrIDent/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; 360SE)","Mozilla/4.0 (compatible; MSIE 7.0; windows NT 5.1; Avant browser)","Mozilla/5.0 (iPhone; U; cpu iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML,like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (iPod; U; cpu iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML,"Mozilla/5.0 (iPad; U; cpu OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML,"Mozilla/5.0 (linux; U; AndroID 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1","MQQbrowser/26 Mozilla/5.0 (linux; U; AndroID 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML,"Opera/9.80 (AndroID 2.3.4; linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10","Mozilla/5.0 (linux; U; AndroID 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML,like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML,like Gecko) Version/6.0.0.337 Mobile Safari/534.1+","Mozilla/5.0 (hp-tablet; linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML,like Gecko) wOSbrowser/233.70 Safari/534.6 touchPad/1.0","Mozilla/5.0 (SymbianOS/9.4; SerIEs60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML,like Gecko) browserNG/7.1.18124","Mozilla/5.0 (compatible; MSIE 9.0; windows Phone OS 7.5; TrIDent/5.0; IEMobile/9.0; HTC; Titan)","UCWEB7.0.2.37/28/999","NOKIA5700/ UCWEB7.0.2.37/28/999","Openwave/ UCWEB7.0.2.37/28/999","Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",# iPhone 6:	"Mozilla/6.0 (iPhone; cpu iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML,like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25"

运行结果,就是这样,同学们,都把裤子给我穿上!好好学习!

接着我们来扩展一个知识点,如何使用Python实现自动鉴别图片,鉴别黄色图片的思路是,讲图片中的每一个位读入内存然后将皮肤颜色填充为白色,将衣服填充为黑色,计算出整个人物的像素大小,然后计算身体颜色与衣服的比例,如果超出预定义的范围则认为是黄图,这是基本的原理,实现起来需要各种算法的支持,Python有一个库可以实现 pip install Pillow porndetective 鉴别代码如下。

>>> from porndetective import PornDetective>>> test=PornDetective("c://1.jpg")>>> test.parse()c://1.jpg JPEG 1600×2400: result=True message='Porn Pic!!'<porndetective.PornDetective object at 0x0000021ACBA0EFD0>>>>>>> test=PornDetective("c://2.jpg")>>> test.parse()c://2.jpg JPEG 1620×2430: result=False message='Total skin percentage lower than 15 (12.51)'<porndetective.PornDetective object at 0x0000021ACBA5F5E0>>>> test.resultFalse

鉴定结果如下,识别率不是很高,其实第一张并不算严格意义上的黄图,你可以使用爬虫爬取所有妹子图,然后通过调用这个库对其进行检测,如果是则保留,不是的直接删除,只保留优质资源。

他这个库使用的算法有一些问题,如果照这样来分析,那肚皮舞之类的都会被鉴别为黄图,而且一般都会使用机器学习识别率更高,这种硬编码的方式一般的还可以,如果更加深入的鉴别根本做不到,是不是黄图,不能只从暴露皮肤方面判断,还要综合考量,姿势,暴露尺度,衣服类型,等各方面,不过这也够用,如果想要在海量图片中筛选出比较优质的资源,你可以这样来写。

from PIL import Imageimport osfrom porndetective import PornDetectiveif __name__ == "__main__":    img_dic = os.Listdir("./meizitu/")        for each in img_dic:        img = Image.open("./meizitu/{}".format(each))        wIDth = img.size[0]  # 宽度        height = img.size[1] # 高度        img = img.resize((int(wIDth*0.3),int(height*0.3)),Image.ANTIAliAS)        img.save("image.jpg")        test = PornDetective("./image.jpg")        test.parse()        if test.result == True:            print("{} 图片大赞,自动为你保留.".format(each))        else:            print("----> {} 图片正常,自动清除,节约空间,存着真的是浪费资源老铁".format(each))            os.remove("./meizitu/"+str(each))

妹子图去重,代码如下,这个代码我写了好一阵子,一开始没思路,后来才想到的,其原理是利用CRC32算法,计算图片hash值,比对hash值,并将目录与hash关联,最后定位到目录,只删除多余的图片,保留其中的一张,这里给出思路代码。

import zlib,osdef Find_Repeat_file(file_path,file_type):    Catalogue = os.Listdir(file_path)    CatalogueDict = {}  # 查询字典,方便后期查询键值对对应参数    for each in Catalogue:        path = (file_path + each)        if os.path.splitext(path)[1] == file_type:            with open(path,"rb") as fp:                crc32 = zlib.crc32(fp.read())                # print("[*] 文件名: {} CRC32校验: {}".format(path,str(crc32)))                CatalogueDict[each] = str(crc32)    CatalogueList = []    for value in CatalogueDict.values():    # 该过程实现提取字典中的crc32特征组合成列表 CatalogueList        CatalogueList.append(value)    CountDict = {}    for each in CatalogueList:    # 该过程用于存储文件特征与特征重复次数,放入 CountDict        CountDict[each] = CatalogueList.count(each)            RepeatfileFeatures = []    for key,value in CountDict.items():    # 循环查找字典中的数据,如果value大于1就存入 RepeatfileFeatures        if value > 1:            print("[-] 文件特征: {} 重复次数: {}".format(key,value))            RepeatfileFeatures.append(key)    for key,value in CatalogueDict.items():        if value == "1926471896":            print("[*] 重复文件所在目录: {}".format(file_path + key))if __name__ == "__main__":    Find_Repeat_file("D://python/",".jpg")

来来来,小老弟,我们去探讨一下技术,学好技术,每天都开荤

蜘蛛爬虫最终代码:

import os,re,random,urllib,argparsefrom urllib import request,parse# 随机获取一个请求体def GetUserAgent(url):    Usrhead = ["windows; U; windows NT 6.1; en-us","windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64","windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; windows NT 8.1;","X11; linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; windows Server 6.1","Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; windows NT 5.1","iPad; cpu OS 4_3_3;"]    UsrFox = ["Chrome/60.0.3100.0","Auburn browser","Safari/522.13","Chrome/80.0.1211.0","firefox/74.0","Gecko/20100101 firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5","Version/4.0 Safari/534.13","wOSbrowser/233.70 BaIDu browser/534.6 touchPad/1.0","browserNG/7.1.18124","rIDent/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSbrowser/233.70","Opera/UCWEB7.0.2.37"]    UsrAgent = "Mozilla/5.0 (" + str(random.sample(Usrhead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \    + ".36 (KHTML,like Gecko) " + str(random.sample(UsrFox,1)[0])        UsrRefer = str(url + "/" + "".join(random.sample("abcdef23457sdaDW",10)))    UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer}    return UserAgent# 通过内置库,timeout=3)    if respon.status == 200:        HTML = respon.read().decode("utf-8") # 或是gbk根据页面属性而定        return HTML# 传入参数,end):        temporary = page.format(each)        url.append(temporary)    return url if __name__ == "__main__":    urls = "https://www.meitulu.com/item/{}_{}.HTML".format(str(random.randint(1000,20000)),"{}")        page_List = SplicingPage(urls,100)    for item in page_List:        try:            respon = GetPageURL(str(item))            subject = re.findall('<img src="([^"]+\.jpg)"',re.S)            for each in subject:                img_name = each.split("/")[-1]                img_type = each.split("/")[-1].split(".")[1]                save_name = str(random.randint(11111111,999999999)) + "." + img_type                print("[+] 原始名称: {} 保存为: {} 路径: {}".format(img_name,each))                #urllib.request.urlretrIEve(each,None)  # 无请求体的下载图片方式                head = GetUserAgent(str(urls))                # 随机d出请求头                ret = urllib.request.Request(each,headers=head)   # each = 访问图片路径                respons = urllib.request.urlopen(ret,timeout=10)  # 打开图片路径                with open(save_name,"wb") as fp:                    fp.write(respons.read())        except Exception:            # 删除当前目录下小于100kb的图片            for each in os.Listdir():                if each.split(".")[1] == "jpg":                    if int(os.stat(each).st_size / 1024) < 100:                        print("[-] 自动清除 {} 小于100kb文件.".format(each))                        os.remove(each)            exit(1)

最后的效果,高并发下载(代码分工明确:有负责清理重复的,有负责删除小于150kb的,有负责爬行的,包工头非你莫属)今晚通宵

上方代码还有许多需要优化的地方,例如我们是随机爬取,现在我们只想爬取其中的一部分妹子图,所以我们需要改进一下,首先来获取到需要的链接,找首先找所有A标签,提取出页面A标题。

from bs4 import BeautifulSoupimport requestsif __name__ == "__main__":    get_url = []    urls = requests.get("https://www.meitulu.com/t/youhuo/")    soup = BeautifulSoup(urls.text,"HTML.parser")    soup_ret = soup.select('div[] ul[] a')    for each in soup_ret:        if str(each["href"]).endswith("HTML"):            get_url.append(each["href"])                for item in get_url:        for each in range(2,30):            url = item.replace(".HTML","_{}.HTML".format(each))            with open("url.log","a+") as fp:                fp.write(url + "\n")

接着直接循环爬取就好,这里并没有多线程,爬行会有点慢的

from bs4 import BeautifulSoupimport requests,randomdef GetUserAgent(url):    Usrhead = ["windows; U; windows NT 6.1; en-us","Referer":UsrRefer}    return UserAgenturl = []with open("url.log","r") as fp:    files = fp.readlines()    for i in files:                url.append(i.replace("\n",""))                    for i in range(0,9999):        aget = GetUserAgent(url[i])        try:            ret = requests.get(url[i],timeout=10,headers=aget)            if ret.status_code == 200:                soup = BeautifulSoup(ret.text,"HTML.parser")                soup_ret = soup.select('div[] img')                for x in soup_ret:                    try:                        down = x["src"]                        save_name = str(random.randint(11111111,999999999)) + ".jpg"                        print("xiazai -> {}".format(save_name))                        img_download = requests.get(url=down,headers=aget,stream=True)                        with open(save_name,"wb") as fp:                            for chunk in img_download.iter_content(chunk_size=1024):                                fp.write(chunk)                    except Exception:                        pass        except Exception:            pass

另外两个网站的爬虫程序公开: wuso

import os,argparse,sysfrom urllib import request,parsefrom bs4 import BeautifulSoupdef GetUserAgent(url):    Usrhead = ["windows; U; windows NT 6.1; en-us",1)[0])        UsrRefer = url + str("/" + "".join(random.sample("abcdefghi123457sdaDW","Referer":UsrRefer}    return UserAgentdef GetPageURL(page):    head = GetUserAgent(page)    req = request.Request(url=page,timeout=30)    if respon.status == 200:        HTML = respon.read().decode("utf-8")        return HTMLif __name__ == "__main__":    runt = []    waibu = GetPageURL("https://xxx.me/forum.PHP?mod=forumdisplay&fID=48&typeID=114&filter=typeID&typeID=114")    soup1 = BeautifulSoup(waibu,"HTML.parser")    ret1 = soup1.select("div[ID='threadList'] ul[ID='waterfall'] a")    for x in ret1:        runt.append(x.attrs["href"])    for ss in runt:        print("[+] 爬行: {}".format(ss))        try:            resp = []            respon = GetPageURL(str(ss))            soup = BeautifulSoup(respon,"HTML.parser")            ret = soup.select("div[class='pct'] div[class='pcb'] td[class='t_f'] img")            try:                for i in ret:                    url = "https://xxx.me/" + str(i.attrs["file"])                    print(url)                    resp.append(url)            except Exception:                pass                            for each in resp:                try:                    img_name = each.split("/")[-1]                    print("down: {}".format(img_name))                    head=GetUserAgent("https://wuso.me")                    ret = urllib.request.Request(each,headers=head)                    respons = urllib.request.urlopen(ret,timeout=60)                    with open(img_name,"wb") as fp:                        fp.write(respons.read())                        fp.close()                except Exception:                    pass        except Exception:            pass

2.0

import os,timeout=30)    if respon.status == 200:        HTML = respon.read().decode("utf-8")        return HTML# 获取到当前页面中所有连接def getpage():# https://.me/forum.PHP?mod=forumdisplay&fID=48&filter=typeID&typeID=17    waibu = GetPageURL("https://.me/forum.PHP?mod=forumdisplay&fID=48&filter=typeID&typeID=17")    soup1 = BeautifulSoup(waibu,"HTML.parser")    ret1 = soup1.select("div[ID='threadList'] ul[ID='waterfall'] a")    for x in ret1:        print(x.attrs["href"])# 获取到页面中的图片路径def get_page_image(url):    respon = GetPageURL(str(url))    soup = BeautifulSoup(respon,"HTML.parser")    ret = soup.select("div[class='pcb'] div[class='pattl'] div[class='mbn savephotop'] img")    resp = []    try:        for i in ret:                    url = "https://.me/" + str(i.attrs["file"])            print(url)            resp.append(url)    except Exception:        pass    return resp# 下载if __name__ == "__main__":# https://.me/forum.PHP?mod=vIEwthread&tID=747730&extra=page%3D1%26filter%3DtypeID%26typeID%3D17# python main.py ""    args = sys.argv    user = str(args[1])    resp = get_page_image(user)    for each in resp:        try:            img_name = each.split("/")[-1]            head=GetUserAgent("https://.me")            ret = urllib.request.Request(each,headers=head)            respons = urllib.request.urlopen(ret,timeout=10)            with open(img_name,"wb") as fp:                fp.write(respons.read())                fp.close()            print("down: {}".format(img_name))        except Exception:            pass

第二个爬虫程序: 这个开一个多线程,用另外一个程序开多进程,爬取速度非常快,cpu 100%利用率

import os,sysimport subprocess# 每行一个人物名称。fp = open("lis.log","r")aaa = fp.readlines()for i in aaa:    nam = i.replace("\n","")    cmd = "python thread.py " + nam    os.popen(cmd)

多线程代码。

import requests,randomfrom bs4 import BeautifulSoupimport os,parseimport threading,sysdef GetUserAgent(url):    head = ["windows; U; windows NT 6.1; en-us","windows NT 6.3; x86_64","windows U; NT 6.2; x86_64","windows NT 6.1; WOW64","X11; linux i686;","X11; linux x86_64;","compatible; MSIE 9.0; windows NT 6.1;","compatible; MSIE 7.0; windows NT 6.0","iPad; cpu OS 4_3_3;",]    fox = ["Chrome/60.0.3100.0","Chrome/59.0.2100.0","wOSbrowser/233.70 Safari/534.6 touchPad/1.0","browserNG/7.1.18124"]    agent = "Mozilla/5.0 (" + str(random.sample(head,like Gecko) " + str(random.sample(fox,1)[0])    refer = url    UserAgent = {"User-Agent": agent,"Referer":refer}    return UserAgentdef run(user):    head = GetUserAgent("aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v")    ret = requests.get("aHR0cHM6Ly93d3cuYW1ldGFydC5jb20vbW9kZWxzL3t9Lw==".format(user),timeout=3)    scan_url = []    if ret.status_code == 200:        soup = BeautifulSoup(ret.text,"HTML.parser")        a = soup.select("div[class='thumbs'] a")        for each in a:            url = "aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v" + str(each["href"])            scan_url.append(url)    rando = random.choice(scan_url)    print("随机编号: {}".format(rando))    try:        ret = requests.get(url=str(rando),timeout=10)        if ret.status_code == 200:            soup = BeautifulSoup(ret.text,"HTML.parser")            img = soup.select("div[class='container'] div div a")            try:                for each in img:                    head = GetUserAgent(str(each["href"]))                    down = requests.get(url=str(each["href"]),headers=head)                    img_name = str(random.randint(100000000,9999999999)) + ".jpg"                    print("[+] 图片解析: {} 保存为: {}".format(each["href"],img_name))                    with open(img_name,"wb") as fp:                        fp.write(down.content)            except Exception:                pass    except Exception:        exit(1)if __name__ == "__main__":    args = sys.argv    user = str(args[1])    try:        os.mkdir(user)        os.chdir("D://python/test/" + user)        for item in range(100):            t = threading.Thread(target=run,args=(user,))            t.start()    except fileExistsError:        exit(0)

开20个进程,每个进程里面驮着100个线程,并发访问每秒,1500次请求,因为有去重程序在不断地扫描,所有图片无重复,并保留质量最高的图片,突然发现,妹子图多了之后,妹子都不好看了 ,哈哈哈


经过爬站,之后我们得到了几万张妹子图,但是如果我们想看其中的一个妹子的写真,肿么办? 接下来登场的是AI人脸识别军团,通过简单地机器学习,识别特定人脸,来筛选我们想要看的妹子图。

import cv2import numpy as npdef display_Face(img_path):    img = cv2.imread(img_path)                                                  # 读取图片    gray = cv2.cvtcolor(img,cv2.color_BGR2GRAY)                                # 将图片转化成灰度    face_cascade = cv2.CascadeClassifIEr("haarcascade_frontalface_default.xml") # 加载级联分类器模型    face_cascade.load("haarcascade_frontalface_default.xml")    faces = face_cascade.detectMultiScale(gray,1.3,5)    for (x,y,w,h) in faces:    # 在原图上画出包围框(蓝色框,宽度3)        img = cv2.rectangle(img,(x,y),(x + w,y + h),(255,0),3)    cv2.nameDWindow("img",0);    cv2.resizeWindow("img",300,400);    cv2.imshow('img',img)    cv2.waitKey()def Return_Face(img_path):    img = cv2.imread(img_path)      gray = cv2.cvtcolor(img,cv2.color_BGR2GRAY)    face_cascade = cv2.CascadeClassifIEr("haarcascade_frontalface_default.xml")    faces = face_cascade.detectMultiScale(gray,scaleFactor=1.2,minNeighbors=5)    if (len(faces) == 0):        return None,None    (x,h) = faces[0]    return gray[y:y + w,x:x + h],faces[0]ret = Return_Face("./meizi/172909315.jpg")print(ret)display_Face("./meizi/172909315.jpg")

import cv2,osimport numpy as npdef Return_Face(img_path):    img = cv2.imread(img_path)      gray = cv2.cvtcolor(img,faces[0]#载入图像   读取ORL人脸数据库,准备训练数据def LoadImages(data):    images=[]    names=[]    labels=[]    label=0    #遍历所有文件夹    for subdir in os.Listdir(data):        subpath=os.path.join(data,subdir)        #print('path',subpath)        #判断文件夹是否存在        if os.path.isdir(subpath):            #在每一个文件夹中存放着一个人的许多照片            names.append(subdir)            #遍历文件夹中的图片文件            for filename in os.Listdir(subpath):                imgpath=os.path.join(subpath,filename)                img=cv2.imread(imgpath,cv2.IMREAD_color)                gray_img=cv2.cvtcolor(img,cv2.color_BGR2GRAY)                #cv2.imshow('1',img)                #cv2.waitKey(0)                images.append(gray_img)                labels.append(label)            label+=1    images=np.asarray(images)    #names=np.asarray(names)    labels=np.asarray(labels)    return images,labels,namesimages,names = LoadImages("./")face_recognizer = cv2.face.LBPHFaceRecognizer_create()# 创建LBPH识别器并开始训练face_recognizer.train(images,labels)

收集的其他爬虫: 网络收集的其他爬虫写法,可参考。
1

# -*- Coding: UTF-8 -*-import sys,requestsfrom bs4 import BeautifulSoupsys.path.append("/Python")import conf.MysqL_db as MysqLdbimage_count = 1#获取套图下的每套图片信息def get_photo_info(url,layout_tablename):    global Photonames    HTML = get_HTML(url)    # HTML = fread('ttb.HTML')    soup = BeautifulSoup(HTML,"lxml")    db = MysqLdb.Database()    icount = 1    for ul in soup.find_all(class_ = 'ul960c'):        for li in ul:            if (str(li).strip()):                Photoname = li.span.string                PhotoUrl = li.img['src']                imageUrl = 'http://www.quantuwang.co'+li.a['href']                print('第'+str(icount)+'套图:'+Photoname+' '+PhotoUrl+' '+imageUrl)                sql = "insert into "+layout_tablename+"(picname,girlname,picpath,flodername) values('%s','%s'," \                      "'%s','%s')" % (imageUrl,Photoname,PhotoUrl,Photoname)                db.execute(sql)                icount = icount + 1    db.close()    return True#查找套图内的每张图片信息并保存def get_images(image_tablename,pic_nums,pic_Title,url,layout_count):    global image_count    db = MysqLdb.Database()    try:        for i in range(1,int(pic_nums)):            pic_url = url[:-5] + str(i) + '.jpg'            sql = "insert into "+image_tablename+"(ID,imageID,flodername,imagepath) " \                  "values (" + str(i) + ","+str(image_count)+",'" + pic_Title + "','" + pic_url + "')"            db.execute(sql)            print('第'+str(layout_count)+'套写真'+str(image_count)+',第'+str(i)+'张图片:'+pic_Title+' url:'+pic_url)            image_count = image_count + 1    except Exception as e:        print('Error',e)    db.close()#获取首页图片信息中的每页链接def get_image_pages(url):    HTML = get_HTML(url)    soup = BeautifulSoup(HTML,"lxml")    # print(HTML)    image_pages = []    image_pages.append(url)    try:        for ul in soup.find_all(class_='c_page'):            for li in ul.find_all('a'):                image_pages.append('http://www.quantuwang.co/'+li.get('href'))    except Exception as e:        print('Error',e)    return len(image_pages)#获取网页信息,得到的HTML就是网页的源代码,传url,返回HTMLdef get_HTML(url):    headers = {        'Accept': 'text/HTML,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',# 'Accept - EnCoding': 'gzip,deflate',# 'Accept-Language': 'zh-CN,zh;q=0.9','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.79 Safari/537.36',}    resp = requests.get(url,headers=headers)    resp.enCoding='utf-8'    HTML = resp.text    # fwrite(HTML)    return HTML#处理urldef handle_url(i,url):    if i == 1:        return url    else:        url = url[:-5] + "_" + format(i) + ".HTML"        return urldef main():    global image_count    # image_count = 1391    url = 'http://www.quantuwang.co/t/f4543e3a7d545391.HTML'    layoyt_name = '糯美子Mini'    layout_tablename = 'pc_dic_'+'nuomeizi'    image_tablename = 'po_'+'nuomeizi'    #复制表结构    db = MysqLdb.Database()    try:        sql = "create table if not exists "+layout_tablename+"(liKE pc_dic_toxic)"        db.execute(sql)        print('创建表:'+layout_tablename)        sql = "create table if not exists " + image_tablename + "(liKE po_toxic)"        db.execute(sql)        print('创建表:'+image_tablename)    except Exception as e:        print('Error',e)    db.close()    #第一步:搜索页面信息截取    get_photo_info(url,layout_tablename)    #第二步:找出图集中的每张图片,插入数据库    layout_count = 1    db = MysqLdb.Database()    sql = 'select * from '+layout_tablename+' where ID>0'    results = db.fetch_all(sql)    for row in results:        # 找出套图信息:图片数量        imgage_nums = get_image_pages(row['picname']) + 1        get_images(image_tablename,imgage_nums,row['flodername'],row['picpath'],layout_count)        layout_count = layout_count + 1    db.close()    #更新总表    db = MysqLdb.Database()    try:        sql = "select max(imageID) as maxcount from "+image_tablename        results = db.fetch_one(sql)        sql = "insert into pc_dic_lanvshen(Beautyname,MinID,MaxID,tablename,Indexname,IndexType) values ('%s',%d," \              "'%s',%d)" % (layoyt_name,1,int(results['maxcount']),image_tablename,layout_tablename,1)        db.execute(sql)        print('数据已更新到总表:'+layout_tablename+' '+image_tablename)    except Exception as e:        print('Error',e)    db.close()if __name__ == '__main__':    main()

2

#!/usr/local/Cellar/python/3.7.3/bin# -*- Coding: UTF-8 -*-# https://www.meitulu.comimport sys,requests,time,refrom bs4 import BeautifulSoupsys.path.append("/Python")import conf.MysqL_db as MysqLdbalbum_count = 1image_count = 1#获取套图下的每套图片信息def get_photo_info(url,layout_tablename):    global album_count    headers = {        'Accept': 'text/HTML,'Accept - EnCoding': 'gzip,deflate,br','Accept-Language': 'zh-CN,like Gecko) Chrome/80.0.3987.149 Safari/537.36'    }    req = requests.get(url,headers=headers)    req.enCoding = 'utf-8'    # print(req.text)    soup = BeautifulSoup(req.text,"lxml")    db = MysqLdb.Database()    for ul in soup.find_all(class_ = 'img'):        for li in ul:            if (str(li).strip()):                Albumname = li.img['alt']                AlbumNums = re.findall(r"\d+\.?\d*",li.p.string)[0]                AlbumUrl = li.a['href']                PhotoUrl = li.img['src']                print('第'+str(album_count)+'套图:'+Albumname+' '+AlbumUrl+' '+PhotoUrl)                sql = "insert into "+layout_tablename+"(picname,'%s')" % (AlbumUrl,Albumname,AlbumNums,Albumname)                db.execute(sql)                album_count = album_count + 1    db.close()    return True#保存每张图片信息def get_images(image_tablename,image_nums,image_url,albumID):    global image_count    db = MysqLdb.Database()    for i in range(1,int(image_nums)+1):        image_path = image_url[:-6] + '/' + str(i) + '.jpg'        sql = "insert into " + image_tablename + "(imageID,imagepath,ID) values('%s','%s')" % (image_count,image_path,i)        db.execute(sql)        print('第'+str(albumID)+'套写真'+str(image_count)+',第'+str(i)+'张图片:'+flodername+' url:'+image_path)        image_count = image_count + 1    db.close()#判断网页是否存在def get_HTML_status(url):    req = requests.get(url).status_code    if(req == 200):        return True    else:        return Falsedef main():    global album_count    global image_count    # image_count = 1391    url = 'https://www.meitulu.com/t/dingziku/'    album_name = '丁字裤美女'    album_tablename = 'pc_dic_'+'dingziku'    image_tablename = 'po_'+'dingziku'    #复制表结构    db = MysqLdb.Database()    try:        sql = "create table if not exists "+album_tablename+"(liKE pc_dic_toxic)"        db.execute(sql)        print('创建表:'+album_tablename)        sql = "create table if not exists " + image_tablename + "(liKE po_toxic)"        db.execute(sql)        print('创建表:'+image_tablename)    except Exception as e:        print('Error',album_tablename)    for i in range(2,100):        urls = url +str(i)+'.HTML'        # urls = url +str(i)+'.HTML'        if(get_HTML_status(urls)):            get_photo_info(urls,album_tablename)            time.sleep(random.randint(1,3))        else:           break    #第二步:找出图集中的每张图片,插入数据库    db = MysqLdb.Database()    sql = 'select * from '+album_tablename+' where ID>0'    results = db.fetch_all(sql)    for row in results:        get_images(image_tablename,row['imageID'],row['ID'])    db.close()    #更新总表    db = MysqLdb.Database()    try:        sql = "select max(imageID) as maxcount from "+image_tablename        results = db.fetch_one(sql)        sql = "insert into pc_dic_lanvshen(Beautyname,%d)" % (album_name,album_tablename,1)        db.execute(sql)        print('数据已更新到总表:'+album_tablename+' '+image_tablename)    except Exception as e:        print('Error',e)    db.close()if __name__ == '__main__':    main()

3

#!/usr/local/Cellar/python/3.7.3/bin# -*- Coding: UTF-8 -*-# https://www.lanvshen.comimport sys,randomfrom bs4 import BeautifulSoupsys.path.append("/Python")import conf.MysqL_db as MysqLdblayout_count = 1image_count = 1#查找每套图集信息def get_layout(url,layout_tablename):    global layout_count    headers = {        'Accept': 'text/HTML,"lxml")    db = MysqLdb.Database()    try:        for ul1 in soup.find_all(class_='hezi'):            for ul2 in ul1:                if(str(ul2).strip()):                    for li in ul2:                        if (str(li).strip()):                            layout_url = li.a['href']                            cover_url = li.img['src']                            layout_nums = re.findall('(\d+)',li.span.string)[0]                            layout_name = li.find_all("p",class_="biaoti")[0].a.string                            print('第'+str(layout_count)+'套写真:'+layout_name+" url:"+layout_url)                            # print('写真集:'+layout_name+' 图片数:'+str(layout_nums)+' 链接:'+cover_url)                            sql = "insert into "+layout_tablename+"(ID,picname,flodername) values (" +\                                  str(layout_count)+ ",'" + layout_url + "','" + layout_name + "','"+cover_url+"',"+str(layout_nums)+",'" + layout_name + "')"                            db.execute(sql)                            layout_count=layout_count+1    except Exception as e:        print('Error',e)    db.close()#查找套图内的每张图片信息def get_images(image_tablename,url):    global image_count    global layout_count    url_num = re.findall('(\d+)',url)[0]    db = MysqLdb.Database()    for i in range(1,int(pic_nums)):        pic_url = 'https://img.hywly.com/a/1/' + url_num + '/' + str(i) + '.jpg'        sql = "insert into "+image_tablename+"(ID,imagepath) " \              "values (" + str(i) + ",'" + pic_url + "')"        db.execute(sql)        print('第'+str(layout_count)+'套写真,第'+str(i)+'张图片:'+pic_Title+' url:'+pic_url)        image_count = image_count + 1    db.close()#判断网页是否存在def get_HTML_status(url):    req = requests.get(url).status_code    if(req == 200):        return True    else:        return Falsedef  main():    global layout_count    url='https://www.lanvshen.com/s/16/'    layoyt_name = '蕾丝美女'    layout_tablename = 'pc_dic_'+'leisi'    image_tablename = 'po_'+'leisi'    #复制表结构    db = MysqLdb.Database()    try:        sql = "create table if not exists "+layout_tablename+"(liKE pc_dic_toxic)"        db.execute(sql)        print('创建表:'+layout_tablename)        sql = "create table if not exists " + image_tablename + "(liKE po_toxic)"        db.execute(sql)        print('创建表:'+image_tablename)    except Exception as e:        print('Error',e)    db.close()    #找出写真集的每套图集信息,插入数据库    get_layout(url,layout_tablename)    for i in range(1,100):        urls = url + 'index_'+str(i)+'.HTML'        # urls = url +str(i)+'.HTML'        if(get_HTML_status(urls)):            get_layout(urls,layout_tablename)            time.sleep(random.randint(1,3))        else:           break    #找出图集中的每张图片,插入数据库    layout_count = 1    db = MysqLdb.Database()    sql = 'select * from '+layout_tablename+' order by ID'    results = db.fetch_all(sql)    for row in results:        get_images(image_tablename,row['picname'])        layout_count = layout_count + 1    db.close()    #更新总表    db = MysqLdb.Database()    try:        sql = "select max(imageID) as maxcount from "+image_tablename        results = db.fetch_one(sql)        sql = "insert into pc_dic_lanvshen(Beautyname,e)    db.close()if __name__ == '__main__':    main()

咳咳,快!派森扶我起来,我还能学挖掘机技术,未完待续。。。

总结

以上是内存溢出为你收集整理的Python 多线程实现爬取妹子图全部内容,希望文章能够帮你解决Python 多线程实现爬取妹子图所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/langs/1157947.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-01
下一篇 2022-06-01

发表评论

登录后才能评论

评论列表(0条)

保存