用HTML链接替换文本中的URL_随笔

用HTML链接替换文本中的URL

让我们看一下需求。您有一些用户提供的纯文本，您希望使用超链接的URL显示这些纯文本。

“ http：//”协议前缀应该是可选的。
域和IP地址都应接受。
任何有效的顶级域都应该被接受，例如.aero和.xn–jxalpdlp。
端口号应被允许。
在普通句子上下文中必须允许使用URL。
您可能还希望允许“ https：//” URL，也可能允许其他URL。
与以往一样，在HTML中显示用户提供的文本时，您要防止跨站点脚本（XSS）。另外，您还希望URL中的“＆”号可以正确地转为＆amp;。
您可能不需要对IPv6地址的支持。
编辑：如评论中所述，绝对支持电子邮件地址。
编辑：仅支持纯文本输入-输入中的HTML标记不应被保留。（Bitbucket版本支持HTML输入。）

编辑：检出GitHub以获取最新版本，并支持电子邮件地址，身份验证的URL，引号和括号中的URL，HTML输入以及更新的TLD列表。

这是我的看法：

<?php$text = <<<EODHere are some URLs:stackoverflow.com/questions/1188129/pregreplace-to-detect-html-phpHere's the answer: http://www.google.com/search?rls=en&q=42&ie=utf-8&oe=utf-8&hl=en. What was the question?A quick look at http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax is helpful.There is no place like 127.0.0.1! Except maybe http://news.bbc.co.uk/1/hi/england/surrey/8168892.stm?Ports: 192.168.0.1:8080, https://example.net:1234/.Beware of Greeks bringing internationalized top-level domains: xn--hxajbheg2az3al.xn--jxalpdlp.And remember.Nobody is perfect.<script>alert('Remember kids: Say no to XSS-attacks! Always HTML escape untrusted input!');</script>EOD;$rexProtocol = '(https?://)?';$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}.){3}[0-9]{1,3})';$rexPort     = '(:[0-9]{1,5})?';$rexPath     = '(/[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]*?)?';$rexQuery    = '(?[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]+?)?';$rexFragment = '(#[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]+?)?';// Solution 1:function callback($match){    // Prepend http:// if no protocol specified    $completeUrl = $match[1] ? $match[0] : "http://{$match[0]}";    return '<a href="' . $completeUrl . '">'        . $match[2] . $match[3] . $match[4] . '</a>';}print "<pre>";print preg_replace_callback("&\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:"]?(s|$))&",    'callback', htmlspecialchars($text));print "</pre>";

为了正确地转义<和＆字符，在处理之前，我将整个文本通过htmlspecialchars抛出。这是不理想的，因为html转义会导致对URL边界的错误检测。
正如“记住，没有人是完美的”所证明的。行（记住，由于缺少空格，没有人被视为URL），可能需要进一步检查有效的顶级域。

编辑
：以下代码解决了上述两个问题，但由于我要

preg_replace_callback

使用或多或少地重新实现，因此更加冗长

preg_match

。

// Solution 2:$validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);$position = 0;while (preg_match("{\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:"]?(s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)){    list($url, $urlPosition) = $match[0];    // Print the text leading up to the URL.    print(htmlspecialchars(substr($text, $position, $urlPosition - $position)));    $domain = $match[2][0];    $port   = $match[3][0];    $path   = $match[4][0];    // Check if the TLD is valid - or that $domain is an IP address.    $tld = strtolower(strrchr($domain, '.'));    if (preg_match('{.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))    {        // Prepend http:// if no protocol specified        $completeUrl = $match[1][0] ? $url : "http://$url";        // Print the hyperlink.        printf('<a href="%s">%s</a>', htmlspecialchars($completeUrl), htmlspecialchars("$domain$port$path"));    }    else    {        // Not a valid URL.        print(htmlspecialchars($url));    }    // Continue text parsing from after the URL.    $position = $urlPosition + strlen($url);}// Print the remainder of the text.print(htmlspecialchars(substr($text, $position)));

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5009520.html

用HTML链接替换文本中的URL

发表评论

评论列表（0条）