从日志文件中提取Java错误堆栈_java

概述从日志文件中提取Java错误堆栈

我有一个Java应用程序，当出错时，为每个错误写下类似于下面的错误堆栈。

<Errors> <Error ErrorCode="Code" ErrorDescription="Description" ErrorInfo="" ErrorID="ID"> <Attribute name="ErrorCode" Value="Code"/> <Attribute name="ErrorDescription" Value="Description"/> <Attribute name="Key" Value="Key"/> <Attribute name="Number" Value="Number"/> <Attribute name="ErrorID" Value="ID"/> <Attribute name="UserID" Value="User"/> <Attribute name="ProgID" Value="Prog"/> <Stack>typical Java stack</Stack> </Error> <Error> Similar info to the above </Error> </Errors>

我编写了一个Java日志parsing器来浏览日志文件，并收集有关这些错误的信息，并且在它工作的时候，速度慢，效率低，特别是对于数百兆字节的日志文件。我只是基本上使用string *** 作来检测开始/结束标签在哪里，并理清它们。

有没有一种方法（通过Unix的grep，Python或Java）来有效地提取错误，并得到每一个发生的次数？整个日志文件不是XML，所以我不能使用XMLparsing器或Xpath。我面临的另一个问题是，有时错误的结束可能会滚到另一个文件，所以当前文件可能没有如上所述的整个堆栈。

编辑1：

查找不包含string的文件

grep仅匹配指定范围内的行

使用find，grep和sed在一起

比较两个文件并在匹配的单词后面打印N行

试图将正则Expression式传递给grep

这是我目前的（相关部分只是为了节省空间）。

//Parse files for (file f : allfiles) { System.out.println("Parsing: " + f.getabsolutePath()); BufferedReader br = new BufferedReader(new fileReader(f)); String line = ""; String fullErrorStack = ""; while ((line = br.readline()) != null) { if (line.contains("<Errors>")) { fullErrorStack = line; while (!line.contains("</Errors>")) { line = br.readline(); try { fullErrorStack = fullErrorStack + line.trim() + " "; } catch (NullPointerException e) { //End of file but end of error stack is in another file. fullErrorStack = fullErrorStack + "</Stack></Error></Errors> "; break; } } String errorCode = fullErrorStack.substring(fullErrorStack.indexOf("ErrorCode="") + "ErrorCode="".length(),fullErrorStack.indexOf("" ",fullErrorStack.indexOf("ErrorCode=""))); String errorDescription = fullErrorStack.substring(fullErrorStack.indexOf("ErrorDescription="") + "ErrorDescription="".length(),fullErrorStack.indexOf("ErrorDescription=""))); String errorStack = fullErrorStack.substring(fullErrorStack.indexOf("<Stack>") + "<Stack>".length(),fullErrorStack.indexOf("</Stack>",fullErrorStack.indexOf("<Stack>"))); APIErrors.add(f.getabsolutePath() + splitter + errorCode + ": " + errorDescription + splitter + errorStack.trim()); fullErrorStack = ""; } } } Set<String> uniqueAPIErrors = new HashSet<String>(APIErrors); for (String uniqueAPIError : uniqueAPIErrors) { APIErrorsUnique.add(uniqueAPIError + splitter + Collections.frequency(APIErrors,uniqueAPIError)); } Collections.sort(APIErrorsUnique);

编辑2：

抱歉忘记提及所需的输出。像下面的东西会是理想的。

Count，ErrorCode，ErrorDescription，它出现的文件列表（如果可能的话）

Grep：使用排除特定的文件夹

grep – 删除包含指定字符以外任何内容的行

我如何grep的特殊字符，如的string？

如何从wget输出grep下载速度？

如何将结果文件从一个grep传递给另一个传递给另一个，以便我只通过第二遍传递子集来进行grep处理？

鉴于您更新的问题：

$ cat tst.awk BEGIN{ OFS="," } match($0,/s+*<Error ErrorCode="([^"]+)" ErrorDescription="([^"]+)".*/,a) { code = a[1] desc[code] = a[2] count[code]++ files[code][filename] } END { print "Count","ErrorCode","ErrorDescription","List of files it occurs in" for (code in desc) { fnames = "" for (fname in files[code]) { fnames = (fnames ? fnames " " : "") fname } print count[code],code,desc[code],fnames } } $ $ awk -f tst.awk file Count,ErrorCode,ErrorDescription,List of files it occurs in 1,Code,Description,file

它仍然需要gawk 4. *为第三个arg匹配（）和二维数组，但是在任何awk中都很容易。

在这里的评论请求是一个非gawk版本：

$ cat tst.awk BEGIN{ OFS="," } /[[:space:]]+*<Error / { split("",n2v) while ( match($0,/[^[:space:]]+="[^"]+/) ) { name = value = substr($0,RSTART,RLENGTH) sub(/=.*/,"",name) sub(/^[^=]+="/,value) $0 = substr($0,RSTART+RLENGTH) n2v[name] = value } code = n2v["ErrorCode"] desc[code] = n2v["ErrorDescription"] count[code]++ if (!seen[code,filename]++) { fnames[code] = (code in fnames ? fnames[code] " " : "") filename } } END { print "Count","List of files it occurs in" for (code in desc) { print count[code],fnames[code] } } $ $ awk -f tst.awk file Count,file

有以上几种方法可以完成，一些简单，但是当输入包含名称=值对我喜欢创建一个name2value数组（ n2v[]是我通常给它的名称），所以我可以通过他们的名字访问值。使代码易于理解和修改将来添加字段等

这是我之前的回答，因为有些东西在其他情况下会有用处：

你不会说你想要的输出看起来像样，你张贴的示例输入是不是真的足以测试和显示有用的输出，但是这个GNU awk脚本显示的方式来获得任何属性名称/值对你喜欢：

$ cat tst.awk match($0,/s+*<Attribute name="([^"]+)" Value="([^"]+)".*/,a) { count[a[1]][a[2]]++ } END { print "nIf you just want to see the count of all error codes:" name = "ErrorCode" for (value in count[name]) { print name,value,count[name][value] } print "nor if theres a few specific attributes you care about:" split("ErrorID ErrorCode",names,/ /) for (i=1; i in names; i++) { name = names[i] for (value in count[name]) { print name,count[name][value] } } print "nor if you want to see the count of all values for all attributes:" for (name in count) { for (value in count[name]) { print name,count[name][value] } } }

。

$ gawk -f tst.awk file If you just want to see the count of all error codes: ErrorCode Code 1 Or if theres a few specific attributes you care about: ErrorID ID 1 ErrorCode Code 1 Or if you want to see the count of all values for all attributes: ErrorID ID 1 ErrorDescription Description 1 ErrorCode Code 1 Number Number 1 ProgID Prog 1 UserID User 1 Key Key 1

如果你的数据分布在多个文件中，上面的内容可以不在乎，只需要在命令行中列出它们：

gawk -f tst.awk file1 file2 file3 ...

它使用GNU awk 4. *用于真正的多维数组，但是如果需要的话，还有其他任何awk的变通办法。

在目录下递归找到的文件上运行awk命令的一种方法：

awk -f tst.awk $(find dir -type f -print)

嗯，这在技术上不是grep ，但是如果你愿意使用其他标准的UNIX-esque命令，这里有一个可以完成这个工作的单线程，它应该很快（有兴趣看到你的数据集的结果，实际上）：

sed -r -e '/Errors/,/</Errors>/!d' *.log -ne 's/.*<Errors+ErrorCode="([^"]*)"s+ErrorDescription="([^"]*)".*$/1: 2/p' | sort | uniq -c | sort -nr

假设它们是按日期顺序的， *.log glob也将解决日志滚动的问题（当然，调整以匹配你的日志命名）。

@H_419_81@ 示例输出

从我的（可疑的）基于你的测试数据：

10 SomeOtherCode: This extended description 4 Code: Description 3 ReallyBadCode: disaster Description

@H_419_81@ 简要说明

使用sed仅在所选地址（此处为行）之间打印

再次使用sed用正则表达式过滤这些信息，用一个组成的唯一足够的错误字符串（包括描述）替换标题行，类似于你的Java（或者至少我们可以看到它）

排序和计数这些独特的字符串

按频率递减的顺序呈现

我认为，既然你提到Unix grep，你可能也有Perl。这里有一个简单的perl解决方案：

#!/usr/bin/perl my %countForErrorCode; while (<>) { /<Error ErrorCode="([^"]*)"/ && $countForErrorCode{$1}++ } foreach my $e (keys %countForErrorCode) { print "$countForErrorCode{$e} $en" }

假设你正在运行* nix，保存这个Perl脚本，使其可执行，并运行命令如…

$ ./grepError.pl *.log

你应该得到像…

8 Code1 203 Code2 ...

其中“Code1”等是正则表达式中双引号之间的错误代码。

我在Cygwin的windows上工作过。该解决方案假定：

你的perl的位置是/usr/bin/perl 。你可以用$ which perl来验证

上面的正则表达式， /<Error ErrorCode="([^"]*)"/ ，是你如何计数。

代码正在做…

my %errors声明一个地图（哈希）。

while (<>)迭代每行输入并将当前行分配给内置变量$_ 。

/<Error ErrorCode="([^"]*)"/隐式尝试匹配$_ 。

当匹配发生时，括号将捕获双引号之间的值，并将捕获的字符串分配给$ 1。

正则表达式“返回true”在一个匹配中，只有计数递增&& $countForErrorCode{$1}++ 。

对于输出，使用foreach my $e (keys %countForErrorCode)迭代捕获的错误代码，并在print "$countForErrorCode{$e} $en"的行上print "$countForErrorCode{$e} $en"计数和代码。

编辑：每个更新的规范更详细的输出

#!/usr/bin/perl my %dataForError; while (<>) { if (/<Error ErrorCode="([^"]+)"s*ErrorDescription="([^"]+)"/) { if (! $dataForError{$1}) { $dataForError{$1} = {}; $dataForError{$1}{'desc'} = $2; $dataForError{$1}{'files'} = {}; } $dataForError{$1}{'count'}++; $dataForError{$1}{'files'}{$ARGV}++; } } my @out; foreach my $e (keys %dataForError) { my $files = join("nt",keys $dataForError{$e}{'files'}); my $out = "$dataForError{$e}{'count'},$e,'$dataForError{$e}{'desc'}'nt$filesn"; push @out,$out; } print @out;

就像你上面发布的那样，递归地接收输入文件，你可以像这样运行这个脚本：

$ find . -name "*.log" | xargs grepError.pl

并产生如下产出：

8,Code2,'bang' ./today.log 48,Code4,'oops' ./2015/jan/yesterday.log 2,Code1,'foobar' ./2014/dec/someday.log

说明：

该脚本将每个唯一的错误代码映射到一个散列，该散列跟踪发现错误代码的计数，描述和唯一文件名。

Perl会自动将当前的输入文件名存储到$ARGV 。

脚本统计每个唯一的文件名发生，但不输出这些计数。

总结

以上是内存溢出为你收集整理的从日志文件中提取Java错误堆栈全部内容，希望文章能够帮你解决从日志文件中提取Java错误堆栈所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/langs/1154983.html

从日志文件中提取Java错误堆栈

发表评论

评论列表（0条）