本篇代码、数据集来源于李宏毅老师的HW1
数据集地址
参考代码地址
本文通过李老师的第一作业以及提供的参考代码来进行Pytorch入门。本文为入门文,不会涉及具体的网络设计。
当我们想使用数据训练一个模型的时候,其实主要分为两个步骤:读取数据、训练模型。那么我们就按照这个步骤进行pytorch使用入门。
读取模型 1、使用dataset和dataloader来进行数据读取这是我在参考代码中看到的使用方法,应该也是比较推荐的使用方法。(以下读取已经经过简化,去掉了一些特殊的数据处理)
from torch.utils.data import Dataset, DataLoader import numpy as np class COVID19Dataset(Dataset): ''' Dataset for loading and preprocessing the COVID19 dataset ''' def __init__(self, path): #根据路径读取所需数据(使用pandas) df = pd.read_csv(path) #需要将数据转化为pytorch所需的格式 data = torch.tensor(df.values, dtype=torch.float) #第一列为ID,无用数据,去除 data = data[:,1:] #这里可以取所有列,也可以经过一些筛选,只使用有用的列 feats = list(range(93)) self.target = data[:, -1] self.data = data[:, feats] def __getitem__(self, index): # 必须要实现的魔术方法,用于训练模型时返回数据 return self.data[index], self.target[index] def __len__(self): # 返回数据长度,后面有使用到这个方法 return len(self.data) #再使用dataloader来实现打乱数据,批次读取等效果 batch_size = 100 train_ds = DataLoader(ds, batch_size=batch_size, shuffle=True) dev_ds = DataLoader(ds, batch_size=batch_size, shuffle=True)2、更直接地方法
参考文章
也可以不使用,直接自己实现打乱,按批次读取的效果
import pandas as pd import torch from torch import nn path = './ml2021spring-hw1/covid.train.csv' df = pd.read_csv(path) dataset_tensor = torch.tensor(df.values, dtype=torch.float) # 切分训练集 (60%),验证集 (20%) 和测试集 (20%) random_indices = torch.randperm(dataset_tensor.shape[0]) traning_indices = random_indices[:int(len(random_indices)*0.6)] validating_indices = random_indices[int(len(random_indices)*0.6):int(len(random_indices)*0.8):] testing_indices = random_indices[int(len(random_indices)*0.8):] traning_set_x = dataset_tensor[traning_indices][1:,feats] traning_set_y = dataset_tensor[traning_indices][1:,-1:] validating_set_x = dataset_tensor[validating_indices][1:,feats] validating_set_y = dataset_tensor[validating_indices][1:,-1:] testing_set_x = dataset_tensor[testing_indices][1:,feats] testing_set_y = dataset_tensor[testing_indices][1:,-1:]训练模型
训练模型会比较复杂,具体原理我就不献丑了,只总结步骤。
- 训练模型时同一个数据集训练多次(E-poch),一次训练分为多个批次(batch)来进行
- 训练模型的时候注意有几种模式。训练模式(train),用于训练模型,使用这个模式时会计算梯度更新参数等,简单讲就是训练时使用这个模式。剩余的模式目前我暂时认为是非训练模式,总的来说就是,使用时不用计算梯度,也不用更新参数。
- 选择合适的损失函数,以及相应的优化器,每个批次训练时,使用优化器来根据损失函数来进行参数更新。
for poch in range(e_poch): model.train()#训练模式 for x, y in train_ds: optimizer.zero_grad()#0梯度 pred = model(x) mse_loss = model.cal_loss(pred, y.squeeze(-1)) mse_loss.backward() optimizer.step() model.eval() total_loss = 0 for x, y in dev_ds: with torch.no_grad(): pred = model(x) mse_loss = model.cal_loss(pred, y.squeeze(-1)) total_loss += mse_loss total_loss = total_loss / len(dev_ds) if total_loss < mini_loss: mini_loss = total_loss print("poch %d find better model,MSE loss is %.4fn" % (poch, mini_loss))
自己的简化版完整代码
# PyTorch from torch.utils.data import Dataset, DataLoader import pandas as pd import torch import torch.nn as nn #这里的模型copy参考代码的,不在本文进行模型建立相关的介绍 class NeuralNet(nn.Module): ''' A simple fully-connected deep neural network ''' def __init__(self, input_dim): super(NeuralNet, self).__init__() # Define your neural network here # TODO: How to modify this model to achieve better performance? self.net = nn.Sequential( nn.Linear(input_dim, 32), nn.BatchNorm1d(32), # 使用BN,加速模型训练 nn.Dropout(p=0.2), # 使用Dropout,减小过拟合,注意不能在BN之前 nn.LeakyReLU(), # 更换激活函数 nn.Linear(32, 1) ) # Mean squared error loss self.criterion = nn.MSELoss(reduction='mean') # self.criterion = nn.SmoothL1Loss(size_average=True) def forward(self, x): ''' Given input of size (batch_size x input_dim), compute output of the network ''' return self.net(x).squeeze(1) def cal_loss(self, pred, target): ''' Calculate loss ''' regularization_loss = 0 for param in self.parameters(): # TODO: you may implement L1/L2 regularization here # 使用L2正则项 # regularization_loss += torch.sum(abs(param)) regularization_loss += torch.sum(param ** 2) return self.criterion(pred, target) + 0.00075 * regularization_loss class COVID19Dataset(Dataset): def __init__(self, path): df = pd.read_csv(path) self.data = torch.tensor(df.values, dtype=torch.float) self.target = self.data[:, -1:] self.data = self.data[:, 1:-1] def __getitem__(self, index): return self.data[index], self.target[index] def __len__(self): return len(self.data) train_path = './ml2021spring-hw1/covid.train.csv' ds = COVID19Dataset(train_path) batch_size = 100 train_ds = DataLoader(ds, batch_size=batch_size, shuffle=True) dev_ds = DataLoader(ds, batch_size=batch_size, shuffle=True) e_poch = 10000 mini_loss = 1000 early_stop = 500 model = NeuralNet(93) optimizer = torch.optim.Adam(model.parameters()) for poch in range(e_poch): model.train() for x, y in train_ds: optimizer.zero_grad() pred = model(x) mse_loss = model.cal_loss(pred, y.squeeze(-1)) mse_loss.backward() # TODO optimizer.step() model.eval() total_loss = 0 for x, y in dev_ds: with torch.no_grad(): pred = model(x) mse_loss = model.cal_loss(pred, y.squeeze(-1)) total_loss += mse_loss total_loss = total_loss / len(dev_ds) if total_loss < mini_loss: stop = 0 mini_loss = total_loss c else: stop += 1 if stop > early_stop: break
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)