【python】玩转数据分析、建模、人工智能常用的package整理_java

【python】玩转数据分析、建模、人工智能常用的package整理

- - 一、python读取各种格式的文件
  - - 1、pdf文件——pdfplumber
    - 2、word文件——docx
    - 3、excel文件——xlrd
    - 4、图片文件——PIL
    - 5、从pdf中提取出所有图片——fitz
  - 二、人工智能应用
  - - 1、OCR图像文字识别——easyocr
    - 2、OCR图像文字识别——paddleocr
    - 3、文本分词分析——jieba
    - 4、自动解析全国地址——cpca
    - 5、人脸识别——face_recognition
  - 三、数据处理的性能优化
  - - 1、pandas读取数据加速——feather
    - 2、数据二进制序列化——pickle
    - 3、numpy性能加速——numexpr
    - 4、numpy性能加速——cupy
    - 5、python计算函数优化——numba
    - 6、pandas性能优化——swifter
    - 7、pandas性能优化——modin
  - 四、自动化数据探索性分析工具
  - - 1、d-tale
    - 2、pandas profiling
    - 3、sweetviz
    - 4、autoviz

一、python读取各种格式的文件 1、pdf文件——pdfplumber

import pdfplumber
##用open方法或者load方法 读取pdf内容
with pdfplumber.open("example.pdf",password = 'paswrd') as pdf:
	##获取第一页内容
    first_page = pdf.pages[0]
    ##获取第一页的首字符
    print(first_page.chars[0])

2、word文件——docx

from docx import Document
#读取word文档
document = Document('sample.docx')
#获取所有段落
all_paragraphs = document.paragraphs
#循环所有段落
for paragraph in all_paragraphs:
    #打印每个段落的文字
    print(paragraph.text)

3、excel文件——xlrd

import xlrd
#读入文件
workbook = xlrd.open_workbook(filename='sample.xlsx')
#根据索引获取sheet表格
table = workbook.sheets()[0]
#通过sheet名称获取表格
table = workbook.sheet_by_name(sheet_name='Sheet2')
#获取指定行的内容
table_list = table.row_values(rowx=0, start_colx=0, end_colx=None)
#获取指定列的内容
table_list = table.col_values(colx=0, start_rowx=0, end_rowx=None)

相对应的excel写入包：

from xlwt import *
book = xlwt.Workbook(encoding='utf-8')
sheet1 = book.add_sheet("自定义名字的sheet")
sheet1.write(row_start_point, col_start_point, table_name, style1)

也可以直接用pandas读成一个dataframe：

import pandas as pd
df = pd.read_excel('filename.xlsx',sheet_name='Sheet2')

4、图片文件——PIL

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

##载入图片
picture = Image.open('example.png')
##把图片以散点数据形式显示出来
picture_data = np.array(picture)
plt.imshow(picture.astype('unit8'))

5、从pdf中提取出所有图片——fitz

fitz是pymupdf的子模块，需要先安装pymupdf：

pip install pymupdf

利用fitz提取信息加正则匹配，将模板元素转化为像素后再以图片形式写出：

import fitz
import re
import os
 
file_path = r'C:\xxx\xxx.pdf' # PDF 文件路径
dir_path = r'C:\xxx' # 存放图片的文件夹
 
def pdf2image1(path, pic_path):
    checkIM = r"/Subtype(?= */Image)"
    pdf = fitz.open(path)
    lenXREF = pdf._getXrefLength()
    count = 1
    for i in range(1, lenXREF):
        text = pdf._getXrefString(i)
        isImage = re.search(checkIM, text)
        if not isImage:
            continue
        pix = fitz.Pixmap(pdf, i)
        if pix.size < 10000: # 在这里添加一处判断一个循环
            continue # 不符合阈值则跳过至下
        new_name = f"img_{count}.png"
        pix.writePNG(os.path.join(pic_path, new_name))
        count += 1
        pix = None
 
pdf2image1(file_path, dir_path)
————————————————
版权声明：本文为CSDN博主「小白^-」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/weixin_46737755/article/details/113085763

二、人工智能应用 1、OCR图像文字识别——easyocr

import easyocr
## 实例化reader，指定识别什么语言，这里chi_sim和en代表简体中文和英文
reader = easyocr.Reader(['ch_sim','en'])
## 读取图片中的文字
result = reader.readtext('example.png')
##逐行打印文字结果
for res in result:
     print(res)

打印结果包含文字位置边框的坐标，文字内容和置信度


([[151, 101], [195, 101], [195, 149], [151, 149]], '好', 0.7816301184856115)

2、OCR图像文字识别——paddleocr

from paddleocr import PaddleOCR
##初始化paddleocr，选择使用gpu
ocr=PaddleOCR(use_angle_cls = True, use_gpu = True)
text=ocr.ocr("example.png",cls=True)
#打印识别的文字信息
for t in text:
    print(t[1][0])

3、文本分词分析——jieba

import jieba
#精准模式
res = jieba.cut('某个句子')
#全模式
res = jieba.lcut('某个句子',cut_all=True)
#搜索引擎模式
res = jieba.cut_for_search('某个句子')

for item in res:
	print(item, end=' ')

4、自动解析全国地址——cpca

可自动补全并拆分地址信息中的省、市、区、详细地址和行政区划代码：

import cpca

location_str = ["徐汇区虹漕路461号58号楼5楼", 
                "泉州市洛江区万安塘西工业区", 
                "北京朝阳区北苑华贸城"]

##生成一个包含省、市、区、地址、行政区码列的dataframe：
data=cpca.transform(location_str)

##对于重名区可通过umap参数进行指定：
data = cpca.transform(['朝阳区汉庭酒店'],umap={'朝阳区':110105})

5、人脸识别——face_recognition

人脸对比：识别两张图片中的人脸是否同一人

import face_recognition
##载入两张图片
known_image = face_recognition.load_image_file("lyf1.jpg")
unknown_image = face_recognition.load_image_file("lyf2.jpg")
##两张图片都encoding一下
lyf_encoding = face_recognition.face_encodings(known_image)[0]
unknown_encoding = face_recognition.face_encodings(unknown_image)[0]
##进行比对
results = face_recognition.compare_faces([lyf_encoding], unknown_encoding)
print(results)

人脸定位：找到图片中所有人脸的位置，并用矩形框出来

import face_recognition
import cv2
 
image = face_recognition.load_image_file("lyf1.jpg")
##model可以选择cnn，默认hog；hog速度快一些，准确度差一些
face_locations = face_recognition.face_locations(image,model='cnn')
# A list of tuples of found face locations in css (top, right, bottom, left) order

img = cv2.imread("lyf1.jpg")
cv2.imshow("lyf1.jpg",img) # 原始图片
 
# Go to get the data and draw the rectangle
for i,loc in enumerate(face_locations):
  top,right,bottow,left = loc
  start = (left, top)
  end = (right, bottom)
 
  color = (0,255,255)
  thickness = 2
  cv2.rectangle(img, start, end, color, thickness)

cv2.imshow("face_recon",img)

人脸关键点识别：定位人脸中的鼻子、眼睛、嘴巴等关键部位的位置

from PIL import Image, ImageDraw
import face_recognition
 
image = face_recognition.load_image_file("lyf1.jpg")
 
# Find all facial features in all the faces in the image
face_landmarks_list = face_recognition.face_landmarks(image)
 
# Create a PIL imagedraw object so we can draw on the picture
pil_image = Image.fromarray(image)
d = ImageDraw.Draw(pil_image)
 
for face_landmarks in face_landmarks_list:
 
  # Print the location of each facial feature in this image
  for facial_feature in face_landmarks.keys():
    print("The {} in this face has the following points: {}".format(facial_feature, face_landmarks[facial_feature]))
 
  # Let's trace out each facial feature in the image with a line!
  for facial_feature in face_landmarks.keys():
    d.line(face_landmarks[facial_feature], width=5)
 
# Show the picture
pil_image.show()

三、数据处理的性能优化 1、pandas读取数据加速——feather

用二进制feather文件替代csv文件，大大提高文件读写速度。

安装： pip install feather-format

import feather
##写文件
df.to_feather('example.feather')
##读文件，可以指定只读特定列
df.read_feather('example.feather',columns=['col1','col2'])

2、数据二进制序列化——pickle

python中几乎所有的数据类型（列表，字典，集合，类等）都可以用pickle来序列化

import pickle

a = {'a': 1, 'b': 2}
with open('example.txt', 'wb') as f:
    pickle.dump(a, f)
with open('example.txt', 'rb') as f2:
    b = pickle.load(f2)

3、numpy性能加速——numexpr

numexpr的使用方法很简单：将numpy语言用引号引起来，并使用numexpr中的evaluate方法调用即可：

import numexpr as ne
import numpy as np

a = np.linspace(0,1000,10000)
##用numexpr优化numpy语句
ne.evaluate('a**20')

4、numpy性能加速——cupy

cupy是一个借助CUDA GPU库在英伟达GPU上实现numpy数组的库。GPU自身具有多个CUDA核心促成更好的并行加速。用法和numpy类似。

import cupy as cp

x_gpu = cp.ones((1000,1000,1000))

5、python计算函数优化——numba

numba可以将python函数转换为优化的机器学习代码，速度可以基本接近C或Fortran。使用也非常简单，只需要在自定义函数前面加个装饰器即可：

import numba as nb
@nb.jit
def nb_sum(a):
	sum = 0
	for i in range(len(a)):
		sum+=a[i]
	return sum

import numpy as np
a = np.linspace(0,1000,1000)
nb_sum(a)

numba还支持GPU加速、矢量化加速，刻进一步提高性能。

from numba import cuda
cuda.select_device(1)

@cuda.jit
def cudasquare(x):
	i,j = cuda.grid(2)
	x[i][j] *= x[i][j]

##矢量化
from math import sin
@nb.vectorize()
def nb_vec_sin(a):
	return sin(a)

6、pandas性能优化——swifter

swifter是pandas的一个插件，可以直接在pandas上 *** 作，其功能就是检验计算是否可并行或矢量化，以提高性能。

import pandas as pd
import swifter

df.swifter.apply(lambda x:x.mean())

7、pandas性能优化——modin

modin可以实现pandas的并行读取和运行，对量级较大的数据处理有很好的加速。用法简单，只需要import一下，剩下的和pandas一样用。

import modin.pandas as pd
df = pd.concat([df,df2,df3])

四、自动化数据探索性分析工具 1、d-tale

import dtale
import pandas as pd

df = pd.read_csv('example.csv')
d = dtale.show(df)
d.open_browser()

2、pandas profiling

import pandas as pd
import pandas_profiling

df = pd.read_csv('example.csv')
pandas_profiling.ProfileReport(df)

3、sweetviz

import sweetviz
import pandas as pd

df = pd.read_csv('example.csv')
report = sweetviz.analyze([df,'Train'], target_feat = 'saleprice')
report.show_html('report.html')

4、autoviz

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz('example.csv')

持续补充…

参考文章：
https://mp.weixin.qq.com/s/0CbaLTHGsGbD3hVN5Rcn0A
https://mp.weixin.qq.com/s/-XPGXEl0tRmjKq48z7P-7g
https://blog.csdn.net/weixin_46737755/article/details/113085763
https://blog.csdn.net/juzicode00/article/details/122243330
https://www.jianshu.com/p/7953beff8ca3
https://www.jb51.net/article/182659.htm

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/790144.html