python leaning notes_python

一、图形验证码

阻碍我们爬⾍的。

有时候正是在登录或者请求⼀些数据时候的图形验证码。

因此这⾥我们讲解⼀种能将图⽚翻译成⽂字的技术。

将图⽚翻译成⽂字⼀般被称为光学⽂字识别（Optical Character Recognition），简写为OCR。

实现OCR的库不是很多，特别是开源的。

因为这块存在⼀定的技术壁垒（需要⼤量的数据、算法、机器学习、深度学习知识等），并且如果做好了具有很⾼的商业价值。

因此开源的⽐较少。

这⾥介绍⼀个⽐较优秀的图像识别开源库：Tesseract。

Tesseract:t是⼀个将图像翻译成⽂字的OCR(光学⽂字识别,Optical Character Recognition),⽬前由⾕歌赞助。

Tesseract是⽬前公认最优秀、最准确的开源 OCR库。

Tesseract具有很⾼的识别度，也具有很⾼的灵活性，他可以通过训练识别任何字体。

在python中调佣Tesseract:pip install pytesseract

1.安装完成后，如果想要在命令⾏中使⽤Tesseract，那么应该设置环境变量。

Mac和Linux在安装的时候就默认已经设置好了。

在Windows下把 tesseract.exe所在的路径添加到PATH环境变量中。

C:\Program Files\Tesseract-OCR

2.还有⼀个环境变量需要设置的是，要把训练的数据⽂件路径也放到环境变量中。

在环境变量中，添加⼀个

TESSDATA_PREFIX=D:\Tesseract-OCR\tessdata

3.进⼊cmd输⼊下⾯的命令查看版本，正常运⾏则安装成功

tesseract --version

4.tesseract 图⽚路径⽂件路径

tesseract demo.png a

5.使⽤tesseract识别图像

import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract.exe'
tessdata_dir_config = r'--tessdata-dir "D:\Tesseract-OCR\tessdata"'
image = Image.open('demo.png')
print(pytesseract.image_to_string(image, lang='eng',
config=tessdata_dir_config))

6.⽤pytesseract处理图形验证

验证码URL：
https://passport.lagou.com/vcode/create?
from=register&refresh=1513081451891

二、多线程

1.基本介绍

有很多的场景中的事情是同时进⾏的，⽐如开⻋的时候⼿和脚共同来驾驶汽⻋，再⽐如唱歌跳舞也是同时进⾏的

程序模拟多任务

import time
def sing():
    for i in range(3):
        print("正在唱歌...%d"%i)
        time.sleep(1)
def dance():
    for i in range(3):
        print("正在跳舞...%d"%i)
        time.sleep(1)
if __name__ == '__main__':
    sing()
    dance()

2.主线程和子线程的执行关系

主线程会等待⼦线程结束之后在结束
join() 等待⼦线程结束之后，主线程继续执⾏
setDaemon() 守护线程，不会等待⼦线程结束

import threading
import time
def demo():
    # ⼦线程
    print("hello girls")
    time.sleep(1)
    
if __name__ == "__main__":
    for i in range(5):
        t = threading.Thread(target=demo)
        t.start()

3.查看线程数量

threading.enumerate() 查看当前线程的数量

4.验证⼦线程的执⾏与创建

当调⽤Thread的时候，不会创建线程。

当调⽤Thread创建出来的实例对象的start⽅法的时候，才会创建线程以及开始运⾏这个线程。

继承Thread类创建线程
import threading
import time
class A(threading.Thread):
    
    def __init__(self,name):
        super().__init__(name=name)
    
    def run(self):
        for i in range(5):
            print(i)
if __name__ == "__main__":
    t = A('test_name')    
    t.start()

5.线程间的通信(多线程共享全局变量)

在⼀个函数中，对全局变量进⾏修改的时候，是否要加global要看是否对全局变量的指向进⾏了修改，如果修改了指向，那么必须使⽤global，仅仅是修改了指向的空间中的数据，此时不⽤必须使⽤global 线程是共享全局变量

6.互斥锁和死锁

互斥锁：当多个线程⼏乎同时修改某⼀个共享数据的时候，需要进⾏同步控制，某个线程要更改共享数据时，先将其锁定，此时资源的状态为"锁定",其他线程不能改变，只到该线程释放资源，将资源的状态变成"⾮锁定"，其他的线程才能再次锁定该资源。

互斥锁保证了每次只有⼀个线程进⾏写⼊ *** 作，从⽽保证了多线程情况下数据的正确性。

创建锁
mutex = threading.Lock()
锁定
mutex.acquire()
解锁
mutex.release()

死锁：在线程间共享多个资源的时候，如果两个线程分别占有⼀部分资源并且同时等待对⽅的资源，就会造成死锁。

import threading
import time
class MyThread1(threading.Thread):
    def run(self):
        # 对mutexA上锁
        mutexA.acquire()
        # mutexA上锁后，延时1秒，等待另外那个线程 把mutexB上锁
        print(self.name+'----do1---up----')
        time.sleep(1)
        # 此时会堵塞，因为这个mutexB已经被另外的线程抢先上锁了
        mutexB.acquire()
        print(self.name+'----do1---down----')
        mutexB.release()
        # 对mutexA解锁
        mutexA.release()
class MyThread2(threading.Thread):
    def run(self):
        # 对mutexB上锁
        mutexB.acquire()
        # mutexB上锁后，延时1秒，等待另外那个线程 把mutexA上锁
        print(self.name+'----do2---up----')
        time.sleep(1)
        # 此时会堵塞，因为这个mutexA已经被另外的线程抢先上锁了
        mutexA.acquire()
        print(self.name+'----do2---down----')
        mutexA.release()
        # 对mutexB解锁
        mutexB.release()mutexA = threading.Lock()
mutexB = threading.Lock()
if __name__ == '__main__':
    t1 = MyThread1()
    t2 = MyThread2()
    t1.start()
    t2.start()

7. 避免死锁

程序设计时要尽量避免

添加超时时间等

三、⽣产者消费者模型（一）Queue线程

在线程中，访问⼀些全局变量，加锁是⼀个经常的过程。

如果你是想把⼀些数据存储到某个队列中，那么Python内置了⼀个线程安全的模块叫做queue模块。

Python中的queue模块中提供了同步的、线程安全的队列类，包括FIFO（先进先出）队列Queue，LIFO（后⼊先出）队列LifoQueue。

这些队列都实现了锁原语（可以理解为原⼦ *** 作，即要么不做，要么都做完），能够在多线程中直接使⽤。

可以使⽤队列来实现线程间的同步。

Queue线程⽣产者和消费者

初始化Queue(maxsize)：创建⼀个先进先出的队列。


empty()：判断队列是否为空。


full()：判断队列是否满了。


get()：从队列中取最后⼀个数据。


put()：将⼀个数据放到队列中。

（二）⽣产者和消费者

⽣产者和消费者模式是多线程开发中常⻅的⼀种模式。

通过⽣产者和消费者模式，可以让代码达到⾼内聚低耦合的⽬标，线程管理更加⽅便，程序分⼯更加明确。

⽣产者的线程专⻔⽤来⽣产⼀些数据，然后存放到容器中(中间变量)。

消费者在从这个中间的容器中取出数据进⾏消费

例：一般情况爬取表情包

import requests
from lxml import etree
from urllib import request
import os
import re
def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
   }
    response = requests.get(url)
    text = response.text
    html = etree.HTML(text)
    imgs = html.xpath("//div[@class='ui segmentimghover']//img[@class='ui image lazy']")
    for img in imgs:
        img_url = img.xpath("@data-original")[0]
        suffix = os.path.splitext(img_url)[1]
        alt = img.xpath(".//@alt")[0]
        alt = re.sub(r'[，。

？?,/\\·<>]', '', alt)
        img_name = alt + suffix
        request.urlretrieve(img_url, 'images/{}.jpg'.format(img_name))
        print('正在下载{}'.format(img_name))
def main():
    for i in range(1, 101):
        url ="https://www.fabiaoqing.com/biaoqing/lists/page/{}.html".format(i)
        parse_page(url)
        break
if __name__ == '__main__':
    main()

使用生产者与消费者模式多线程下载表情包

import threading
import requests
from lxml import etree
from urllib import request
import os
import re
from queue import Queue
class Producer(threading.Thread):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
   }
    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Producer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
def run(self):
       pass
    def parse_page(self, url):
        pass
class Consumer(threading.Thread):
    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Consumer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
    def run(self):
        pass
def main():
    page_queue = Queue(100)
    img_queue = Queue(500)
    for x in range(1, 101):
        url = "https://www.fabiaoqing.com/biaoqing/lists/page/%d.html" %x
        page_queue.put(url)
     for x in range(5)
      t = Producer(page_queue, img_queue)
        t.start()
    for x in range(5):
        t = Consumer(page_queue, img_queue)
        t.start()
if __name__ == '__main__':
    main()

CSDN

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/571201.html

python leaning notes

发表评论

评论列表（0条）