您可以将表格转换为第一范式,然后比较存储在每一行中的化合物。起点可以是:
{1}对每行进行标记,然后将标记写入新表。给每个令牌其原始ID 加上
3个字母的前缀,以指示该令牌来自哪个表。{2}按ID对新(“规范化”)表的行进行分组,并执行LISTAGG()。执行自我连接,并找到匹配的“令牌组”。
{1}标记化,创建表为select(CTAS)
create table tokensas select ltrim( -- ltrim() and rtrim() remove leading/trailing spaces (blanks) rtrim( substr( N.wrapped , instr( N.wrapped, ',', 1, T.pos ) + 1 , ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1 ) ) ) token, N.idfrom ( select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id union all select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2 ) N join ( select level as pos -- (max) possible position of char in an existing token from dual connect by level <= ( select greatest( -- find the longest string ie max position (query T1 and T2) ( select max( length( name1 ) ) from t1 ) , ( select max( length( name2 ) ) from t2 ) ) as pos from dual ) ) T on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1 ;
不使用ConNECT BY进行标记化的灵感来自此SO答案。
不使用
connect by:
WITH CTE AS (SELECT 'a,b,c,d,e' temp,1 slno FROM DUAL UNIOn SELECt 'f,g',2 from dual UNIOn SELECt 'h',3 FROM DUAL),x as ( select ','||temp||',' temp ,slno from CTE),iter as (SELECt rownum AS pos FROM all_objects)selectSUBSTr(x.temp ,INSTR(x.temp, ',', 1, iter.pos) + 1 ,INSTR(x.temp, ',', 1, iter.pos + 1)-INSTR(x.temp, ',', 1, iter.pos)-1) temp,x.slnofrom x, iterwhere iter.pos < = (LENGTH(x.temp) - LENGTH(REPLACe(x.temp, ','))) - 1;
TOKENS表的内容如下所示:
SQL> select * from tokens ;TOKEN ID ASCORBIC ACID T1_1 SODIUM HYDROGEN CARBonATE T1_2 CAFFEINE T1_3 PSEUDOEPHEDRINE HYDROCHLORIDE T1_4 PARACETAMOL T1_100 sodium hydroxide T1_110 POTASSIUM HYDROGEN CARBonATE T2_4 SODIUM HYDROGEN CARBonATE T2_5 PARACETAMOL PH. EUR. T2_6 CODEINE PHOSPHATE T2_7 DEXCHLORPHENIRAMINE MALEATE T2_8 DEXCHLORPHENIRAMINE MALEATE T2_10 PARACETAMOL T2_200 ...
{2} GROUP BY,LISTAGG,自我加入
select S1.id id1, S2.id id2, S1.tokengroup_T1, S2.tokengroup_T2from ( select substr( id, 4, length( id ) - 3 ) id , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1 from tokens group by id having substr( id, 1, 3 ) = 'T1_') S1 join ( select substr( id, 4, length( id ) - 3 ) id , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2 from tokens group by id having substr( id, 1, 3 ) = 'T2_') S2 on S1.tokengroup_T1 = S2.tokengroup_T2;-- resultID1 ID2 TOKENGROUP_T1 TOKENGROUP_T2 4 10 DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE 110 210 potassium carbonate + sodium hydroxide potassium carbonate + sodium hydroxide 1 4 ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBonATE ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBonATE 3 6 CAFFEINE + PARACETAMOL PH. EUR. CAFFEINE + PARACETAMOL PH. EUR.
以这种方式执行 *** 作时,您可以使物质按字母顺序排列,并且还可以在此处选择所需的“定界符”(我们使用了“ +”)。
选择
如果这对您没有用,或者您认为这太复杂了,则可以尝试使用TRANSLATE()。在这种情况下,建议您从数据集中删除所有空格/空格(在查询中- 请勿
更改原始数据!),如下所示:
询问
select id1, id2, name1, name2from ( select id_t1 id1 , id_t2 id2 , T1.name1 name1 , T2.name2 name2 from T1 join T2 on translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' ) = translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )) ;
结果
ID1 ID2 NAME1 NAME2 2 5 SODIUM HYDROGEN CARBONATE, SODIUM CARBonATE ANHYDROUS, CITRIC ACID SODIUM HYDROGEN CARBONATE, SODIUM CARBonATE ANHYDROUS 3 6 CAFFEINE, PARACETAMOL PH. EUR. PARACETAMOL PH. EUR.,CAFFEINE 100 10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINEDEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE 110 210 sodium hydroxide, potassium carbonate sodium hydroxide, potassium carbonate
注意: 我已将以下行添加到您的示例数据:
-- T1110, 'sodium hydroxide, potassium carbonate'-- T2210, 'sodium hydroxide, potassium carbonate' 211, 'potassium hydroxide, sodium carbonate'
我发现很容易使用TRANSLATE()来给您“假阳性”,即ID为110、210和211的物质看起来会“匹配”。(换句话说:我认为这不是这项工作的正确工具。)
DBFIDDLE在这里
(点击链接以查看示例表和查询)。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)