使用正则表达式提取格式不同的日期并对它们进行排序-Pandas_随笔

使用正则表达式提取格式不同的日期并对它们进行排序-Pandas

我认为这是Coursera文本挖掘作业之一。好了，您可以使用正则表达式并提取以获取解决方案。dates.txt，即

doc = []with open('dates.txt') as file:    for line in file:        doc.append(line)df = pd.Series(doc)def date_sorter():    # Get the dates in the form of words    one = df.str.extract(r'((?:d{,2}s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|.|s|,)s?d{,2}[a-z]*(?:-|,|s)?s?d{2,4})')    # Get the dates in the form of numbers    two = df.str.extract(r'((?:d{1,2})(?:(?:/|-)d{1,2})(?:(?:/|-)d{2,4}))')    # Get the dates where there is no days i.e only month and year      three = df.str.extract(r'((?:d{1,2}(?:-|/))?d{4})')    #Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.    dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))return pd.Series(dates.sort_values())date_sorter()

输出：

9 1971-04-1084 1971-05-182 1971-07-0853 1971-07-1128 1971-09-12474 1972-01-01153 1972-01-1313 1972-01-26129 1972-05-0698 1972-05-13111 1972-06-10225 1972-06-151972年7月31日171 1972-10-04191 1972-11-30486 1973-01-01335 1973-02-01415 1973-02-0136 1973-02-14405 1973-03-01323 1973-03-01422 1973-04-01375 1973-06-01380 1973-07-01345 1973-10-0157 1973-12-01481 1974-01-01436 1974-02-01104 1974-02-24299 1974-03-01

如果只想返回索引，则

return pd.Series(dates.sort_values().index)

解析第一个正则表达式

 ＃?：非捕获组（（？： d {，2}  s）？＃两位数字组。`？表示在前的令牌或组。此处的2或1数字和空格出现一次或更少。 （？：Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec）[az] *＃组中以任意字母[[]`结尾的单词出现了多次（ *）。 （？：-| 。|  s |，）＃模式匹配-，。，space  s？＃（`？`这里只暗示空格，即前面的标记）  d {，2} [az] *＃小于等于两个数字，末尾有任意数量的字母（*）。（例如：可能是1月1日，13日，22日，1月，12月等）。 （？：-|，|  s）？＃字符-/，/ space可能只出现一次，而由于末尾的'？`可能不会出现  s？＃空间可能存在或根本不存在（最大为1）（这里的“？”仅指空间）  d {2,4}）＃匹配数字2或4

希望能帮助到你。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5644591.html

使用正则表达式提取格式不同的日期并对它们进行排序-Pandas

发表评论

评论列表（0条）