SCU - 4438 Censor 哈希

white_156

于 2019-04-27 19:49:36 发布

阅读量2.3w

点赞数 1

CC 4.0 BY-SA版权

分类专栏：字符处理文章标签：哈希

本文链接：https://blog.csdn.net/white_156/article/details/89503057

字符处理专栏收录该内容

6 篇文章

订阅专栏

本文介绍了一种使用字符串哈希技术进行高效敏感词过滤的方法。通过将字符串转换为哈希值，可以快速判断两个字符串是否相等，从而实现对长文本中敏感词的快速定位和移除。文章详细解释了哈希值的计算方式，以及如何利用哈希值查询字符串子串，最后通过一个实例展示了如何在文本中去除特定的敏感词。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

frog is now a editor to censor so-called sensitive words .
She has a long text p. Her job is relatively simple – just to find the first occurence of sensitive word w and remove it.
frog repeats over and over again. Help her do the tedious work.

Input
The input consists of multiple tests. For each test:
The first line contains 1 string w. The second line contains 1 string p.(1≤length of w,p≤5⋅10⁶, w,p consists of only lowercase letter)

Output
For each test, write 1 string which denotes the censored text.

Sample Input

abc
aaabcbc
b
bbb
abc
ab

Sample Output

a

ab

因为判断两个字符串相等需要匹配每一位是否相等，如果能将字符串映射为一个独一无二的数，那么判断两个字符串只需要判断其映射值是否相同即可。
与二进制相似的，我们可以给字符串的每一个设定一个权值，每一位字母X该位权值，就能得到一个独一无二字符编码。但是这样的问题在于，如果字符串特别长，导致编码无法被记录，超出记录和运算的范围。
但是幸运的是我们可以取余，因为对权值求余后，得到的值大概率是不相同的，这样我们仍然可以将余数设置为权值。
所以这里我们选用unsigned long long 作为记录哈希值的数据类型，因为它会自动对2⁶⁴取余。
在这里我们把字符串记录为如下形式
$\cdot base^n +S[2] \cdot base^{n-1} +\cdot \cdot \cdot + S[n] \cdot base$

base最好为质数，这样求余后较大可能为不同值。

现在给出查询字符串子串的方法。如果我们要查询区间[left,right]内的字符串，则需要其对应的哈希值
$\cdot base^r +S[l+1] \cdot base^{r-1} +\cdot \cdot \cdot + S[r] \cdot base$
不难得出
$]\cdot base^{r-l+1}$

最后说一下这个题，因为要求去除B中所有的字符串A，所以从B左端取出一个字符添加到字符串C中，C中不含有字符串A，所以添加一个字符后，只考虑从最后一位起，向前长度为A.len的后缀子串是否和A相同，如果是，则在C中删除这个后缀再记录下一位；不是则添加这个新的字符。

#include <stdio.h>
#include <climits>
#include <cstring>
#include <time.h>
#include <math.h>
#include <iostream>
#include <algorithm>
#include <stack>
#include <queue>
#include <set>
#include <map>
#include <utility>
#include <vector>
#include <string>

#define INF 0x3f3f3f3f
#define ll long long
#define Pair pair<int,int>
#define re return

#define getLen(name,index) name[index].size()
#define mem(a,b) memset(a,b,sizeof(a))
#define Make(a,b) make_pair(a,b)
#define Push(num) push_back(num)
#define rep(index,star,finish) for(register int index=star;index<finish;index++)
#define drep(index,finish,star) for(register int index=finish;index>=star;index--)
using namespace std;
typedef unsigned long long ULL;
const int maxn=5e6+5;
const ULL base=1e9+7;
ULL p[maxn];
ULL Hash[maxn];

char A[maxn],B[maxn];
char ans[maxn];
inline ULL getSection(int r,int l);
int main(){
    p[0]=1;
    rep(i,1,maxn)
        p[i]=p[i-1]*base;

    while(~scanf("%s%s",A,B+1)){
        int lenA=strlen(A),lenB=strlen(B+1);

        ULL ha=0;
        rep(i,0,lenA){
            ha=ha*base+A[i]-'a';
        }

        mem(Hash,0);
        int pos=1;
        rep(i,1,lenB+1){
            ans[pos]=B[i];
            Hash[pos]=Hash[pos-1]*base+B[i]-'a';
            
            if(pos>=lenA && getSection(pos,pos-lenA+1)==ha){
                pos-=lenA;
            }
            pos++;
        }

        rep(i,1,pos)
            printf("%c",ans[i]);
        printf("\n");

    }
    re 0;
}
inline ULL getSection(int r,int l){
    re Hash[r]-Hash[l-1]*p[r+1-l];
}