c – 检查大型字符串向量中的重复项_C

概述我试图找到重复的字符串实例,其中我有一个~250万字符串的向量.~ 目前我使用的东西如下： std::vector<string> concatVec; // Holds all of the concatenated strings containing columns C,D,E,J and U.std::vector<string> dupecheckVec; // Holds all o 我试图找到重复的字符串实例,其中我有一个~250万字符串的向量.~

目前我使用的东西如下：

std::vector<string> concatVec; // Holds all of the concatenated strings containing columns C,D,E,J and U.std::vector<string> dupecheckVec; // Holds all of the unique instances of concatenated columnsstd::vector<unsigned int> linenoVec; // Holds the line numbers of the unique instances only// copy first element across,it cannot be a duplicate yetdupecheckVec.push_back(concatVec[0]);linenoVec.push_back(0);// copy across and do the dupecheckfor (unsigned int i = 1; i < concatVec.size(); i++){    bool exists = false;    for (unsigned int x = 0; x < dupecheckVec.size(); x++)    {        if (concatVec[i] == dupecheckVec[x])        {            exists = true;        }    }    if (exists == false)    {        dupecheckVec.push_back(concatVec[i]);        linenoVec.push_back(i);    }    else    {        exists = false;    }}

这对于小文件来说很好,但是由于嵌套的for循环和dupecheckVec中包含的字符串数量的增加,文件大小显然会花费很长时间.

在大文件中执行此 *** 作可能不那么可怕？

解决方法如果你不介意重新排序向量,那么这应该在O(n * log(n))时间内完成：

std::sort(vector.begin(),vector.end());vector.erase(std::unique(vector.begin(),vector.end()),vector.end());

为了保留顺序,您可以改为使用(行号,字符串*)对的向量：按字符串排序,使用比较字符串内容的比较器进行单一化,最后按行号排序,沿着以下行：

struct pair {int line,std::string const * string};struct OrderByline {    bool operator()(pair const & x,pair const & y) {        return x.line < y.line;    }};struct OrderByString {    bool operator()(pair const & x,pair const & y) {        return *x.string < *y.string;    }};struct StringEquals {    bool operator()(pair const & x,pair const & y) {        return *x.string == *y.string;    }};std::sort(vector.begin(),vector.end(),OrderByString());vector.erase(std::unique(vector.begin(),StringEquals()),vector.end());std::sort(vector.begin(),OrderByline());

总结

以上是内存溢出为你收集整理的c – 检查大型字符串向量中的重复项全部内容，希望文章能够帮你解决c – 检查大型字符串向量中的重复项所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1236437.html

c – 检查大型字符串向量中的重复项

发表评论

评论列表（0条）