这是一个家庭作业,我解决了问题,但我想找到一个更快的解决方案.
问题如下:我需要弄清楚有多少可能的氨基酸(aa)序列存在总质量m.
我有一个氨基酸表(单字母字符串)和相应的质量(int),我把它放在字典中.
我最初的解决方案是创建aa的所有可能组合,并将每个组合的总质量与质量m进行比较.这适用于少量的m,但是当m开始为数百时,组合的数量变得非常高.
我做了一些小的优化,并使其工作得相当快,因为m< 500这对于这个问题已经足够好了,但我想知道如何让它适用于更高的质量. 这是我到目前为止:
totalmass = mdef pepList(): tempList = [''] temp2List = [] length = 0 total = 0 aminoList = 'GASPVTCINDKEMHFRYW' #this are all the aminoacIDs while length < maxLength: for i in tempList: for j in aminoList: pepMass = peptIDeMass(i+j,masstable) #find the mass of #this peptIDe if pepMass == totalmass: total += 1 elif pepMass <= totalmass: temp2List.append(i+j) tempList = [] for i in temp2List: tempList.append(i) temp2List = [] length = length + 1 print (total)pepList()
我可以在大约一秒钟内获得m = 300的解决方案,但m = 500需要大约40秒
我尝试使用itertools替代方案,但它没有更快:
total = 0pepList = []for i in range(maxLength+1): for p in itertools.combinations_with_replacement(aminoList,i): #order matters for the total number of peptIDes but not for calculating #the total mass amino = ''.join(p) if peptIDeMass(amino,masstable) == mass: pepList.append(amino)print (len(pepList))newpepList = []for i in pepList: for p in itertools.permutations(i,r = len(i)): #I use permutations here to get the total number because order matters if p not in newpepList: newpepList.append(p) total +=1print (total)
样本输入:
m = 270
输出:
22
因此,这可以简化为线性规划问题 – 找到系数的值,使得M = a * A b * C c * D d * E e * G … r * W
一旦你有了解决方案,你就可以生成给定氨基酸组的所有可能的排列 – 或者如果你只需要排列的数量,你可以直接计算它.
编辑:
正如@Hooked指出的那样,这不是线性规划,原因有二:首先,我们需要整数系数,其次,我们正在寻找所有组合,而不是找到一个单一的最优解.
我已经制定了一个递归生成器,如下所示:
from math import floor,ceilimport profileamino_weight = { 'A': 71.038,'C': 103.009,'D': 115.027,'E': 129.043,'F': 147.068,'G': 57.021,'H': 137.059,'I': 113.084,'K': 128.095,'L': 113.084,# you omitted leutine? 'M': 131.040,'N': 114.043,'P': 97.053,'Q': 128.059,# you omitted glutamine? 'R': 156.101,'S': 87.032,'T': 101.048,'V': 99.068,'W': 186.079,'Y': 163.063}def get_float(prompt): while True: try: return float(raw_input(prompt)) except ValueError: pass# This is where the fun happens!def get_mass_combos(aminos,pos,lo,hi,cutoff): this = aminos[pos] # use a pointer into the string,to avoID copying 8 million partial strings around wt = amino_weight[this] kmax = int(floor(hi / wt)) npos = pos - 1 if npos: # more aminos to consIDer recursively for k in xrange(0,kmax + 1): mass = k * wt nlo = lo - mass nhi = hi - mass ncutoff = cutoff - mass if nlo <= 0. and nhi >= 0.: # we found a winner! yIEld {this: k} elif ncutoff < 0.: # no further solution is possible break else: # recurse for cc in get_mass_combos(aminos,npos,nlo,nhi,ncutoff): if k > 0: cc[this] = k yIEld cc else: # last amino - it's this or nothing kmin = int(ceil(lo / wt)) for k in xrange(kmin,kmax+1): yIEld {this: k}def to_string(combo): keys = sorted(combo) return ''.join(k*combo[k] for k in keys)def total_mass(combo): return sum(amino_weight[a]*n for a,n in combo.items())def fact(n): num = 1 for i in xrange(2,n+1): num *= i return numdef permutations(combo): num = 0 div = 1 for v in combo.values(): num += v div *= fact(v) return fact(num) / divdef find_combos(lo,hi): total = 0 bases = [] aminos = ''.join(sorted(amino_weight,key = lambda x: amino_weight[x])) for combo in get_mass_combos(aminos,len(aminos)-1,hi - amino_weight[aminos[0]]): base = to_string(combo) bases.append(base) mass = total_mass(combo) cc = permutations(combo) total += cc print("{} (mass {},{} permutations)".format(base,mass,cc)) print('Total: {} bases,{} permutations'.format(len(bases),total))def main(): lo = get_float('Bottom of target mass range? ') hi = get_float('top of target mass range? ') prof = profile.Profile() prof.run('find_combos({},{})'.format(lo,hi)) prof.print_stats()if __name__=="__main__": main()
它还使用浮点氨基质量来寻找质量范围.在我的机器(i5-870)上搜索748.0和752.0之间的质量,返回7,505个碱基,总共9,400,528个排列,在3.82秒内.
总结以上是内存溢出为你收集整理的python – 用于查找总质量m的可能氨基酸序列的算法优化全部内容,希望文章能够帮你解决python – 用于查找总质量m的可能氨基酸序列的算法优化所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)