大模式的隐藏单词统计数据

论文标题

大模式的隐藏单词统计数据

Hidden Words Statistics for Large Patterns

论文作者

Janson, Svante, Szpankowski, Wojciech

论文摘要

我们在这里研究所谓的子序列模式匹配，也称为隐藏模式匹配，其中一个人在$ n $长度的随机文本中搜索给定的图案$ w $的长度$ m $。兴趣数量是$ w $作为子序列的发生数量（即，不一定是连续的文本位置）。这个问题发现了许多应用程序，从入侵检测到追踪重建，删除通道和基于DNA的存储系统。在所有这些应用中，模式$ w $的长度都是可变的。据我们所知，这个问题仅以固定的长度$ m = O（1）$ [Flajolet，Szpankowski和Vallée，2006年]解决。在我们的主要结果中，我们证明，对于$ m = o（n^{1/3}）$，子序列的数量正态分布。此外，我们表明，在对$ w $结构的某些约束下，渐近正态性可以扩展到$ m = o（\ sqrt {n}）$。对于由相同符号组成的特殊模式$ w $，我们指出，对于$ m = o（n）$，子序列数的分布在渐近正常或渐近日志上是正常的。我们猜想，这种二分法对于所有模式都是正确的。我们使用Hoeffding的$ U $统计数据投影方法来证明我们的发现。

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern $w$ is of variable length. To the best of our knowledge this problem was only tackled for a fixed length $m=O(1)$ [Flajolet, Szpankowski and Vallée, 2006]. In our main result we prove that for $m=o(n^{1/3})$ the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of $w$ the asymptotic normality can be extended to $m=o(\sqrt{n})$. For a special pattern $w$ consisting of the same symbol, we indicate that for $m=o(n)$ the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for $U$-statistics to prove our findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题