FDL#2

This document is a fragment of a preprint. If youre interested, you can always ask for a latest copy from me.

A. Information Theory of FDL

FDLs are algebraic lattices, songiven some elements from the same FDL, we can always find a lower bound and annupper bound. If we only care about the upper bound, we can monotonically refinenthe upper bound with reducible operations,nwhich is the prominent application of most FDLs, also the central thesis of thisnpresentation.

A fresh instance of any FDL should startnfrom the state of its identity element, which gives insignificant information[A1]nthat this vessel has no evidence so far. With non-trivial elements absorbed bynthe instance, it accumulates relevant evidence about the upper bound, until it stopsnat the absorbing element. Under the designated operation, such an instance can benaltered only by non-trivial evidence, so that we can (and have to) ignore allnpossible trivial evidence which are equivalent to identity element. It』s safento say that the identity element is exactly trivial evidence, which can benreduced unconditionally. For any statenof the given FDL, any element smaller than it provides no relevant evidence andncan be reduced conditionally. Or wencan say that the partial order in the corresponding lattice describes the relevancenof evidence. The supremum of some elements in a given FDL retains all thenrelevant evidence[A2],nso that we can reduce all subsequent operations associated with them.

I deliberately used evidence other than information, partly because I』m anfaithful Bayesian. Elements in any given FDL are just symbols, and they cannhave equal information entropy or otherwise, depending on the theoretic ornpractical distribution under certain assumptions. Information is alwaysnrelative and context-sensitive. For a bit vector, an absorbing element might benvery useful, while a similar Bloom Filter is pretty much useless. Nonetheless,nall FDL instances in absorbing states are immutable under their primarynoperation and cannot accommodate new evidence[A3].

For a given FDL, any given state xnof its instances tells us information about the accumulated evidence, i.e.nsupremum of representing elements, and also we can assert that any observednevidence must be represented by certain element between identity element and x,nor [one, x]. Less obviously, it also tells us information about all possiblenrelevant evidence which may augment its further state, which can be enumeratednas all complementary elements, or ~[one, x]. Finally, its subsequent states cannbe predicted as elements between x and absorbing element, or [x, zero].

As a metaphor, it』s like paintingnon a white paper with a black pen. If the paper is all white, we can draw arbitrarilynon it. After some drawing, the figures that we can draw on it become limited bynits history. If we never stop, we』ll get a black paper eventually. For anBoolean variable, this paper can be filled by a single stroke. For an FDL withnfinite capacity, this paper can be filled by finite non-trivial strokes. As wendon』t have FDLs with infinite capacity, we don』t paint on unbounded canvas. Asna result, I would suggest you to buy an eraser or a white pen, but there isnalways a cost for doing so. Anyway, you can prepare more pages.

Now I』m more certain that all FDLsnare generalization of Boolean as a lattice. The useful range of Boolean can be [0,0]nand [1,1] which are both trivial, while [0,1] gives no information at all[A4].nWith Bayesian interpretation, a probability is simply a real number between 0nand 1, or a linear extension of logic to Jaynes. The numeric range of anprobability statement can give non-trivial information. Non-linear FDLs canngive non-linear ranges, which can be more informative. nn

Finally, I can explain why Inpretend that FDLs are only semilattices. For starter, let』s start with a HashnSet with logic statements only, such as {Blue is a color=true, Microsoft isnevil=true, Google is awesome=true}. The first thing I should point out is thatnwe cannot even represent negative evidence in this data structure, such as [nBlue is a color=false, Microsoft is evil=false, Google is awesome=false],nbecause any combinations of those statements are exactly trivial evidence, ornrepresented by identity element. As a patch, we can add statements such as {Bluenis just a word, Microsoft is not evil, Google is boring}, but we have to livenwith contradictive statements. This hypothetical Hash Set is rather innocent,nbecause it does preserve all evidence available. We can do non-trivial analysisnwith FDLs』 other algebraic properties as lattices, but we』d better ignore themnif we want to deploy them as distributed data processors with fidelity. Nonetheless,nback to my hypothetical dilemma, can we handle both negative evidence andnpositive evidence with certain FDLs? The short answer is yes since I just didnit, but the long answer will be rather mouthful. FDLs do not automaticallynresolve logic contradiction, and they don』t do statistical inference at all[A5].nAlgorithms on FDLs do all the heavy lifting, while their domains of validitynare rather limited and conditional. For example, during the processing phase, HyperLogLogncounters[A6]nare just monotonic integer vectors. Equipped with a pencil and logarithmicnscale, an old school mathematician might write done the readings of such ancounter, and could do the same estimation with error distribution. However,nshould you found negative values in those vectors, you』d better think twicenbefore blaming either the cosmic rays or responsible programmers.

[A1] Information about no evidence is not always trivial.

[A2] About their supremum, which is intuitive tautology.

[A3] Or we can declare that such anninstance contains all evidence possible, which is technically true.

[A4] My manager used to answer questionsnwith yes and no, which were bureaucratically correct.

[A5] They are just abstract algebra structure, just as the almightynBoolean logic.

[A6]HyperLogLog

B. Algebraic Symmetry

nBecause of the symmetry betweennjoin-semilattice and meet-semilattice, I will not discuss propertynmeet-semilattice. As a thought experiment, we can operate a reversed BloomnFilter from all 1s by erasing bits (i.e. excluding elements), tontell whether an element is still available (i.e. not excluded yet). Thisnreversed Bloom Filter has the opposite behavior for membership estimation, thatnit has no false positive while having false negative. The minimum of thisnreversed Bloom Filter will report that all elements are possibly not in thencache, which is false negative all the way. Once upon a time, we wanted tondevise a probabilistic cache with this property to accelerate lossless cardinalncounting[B1],nbut I failed to see the obvious duality then.

By pretending that FDLs arensemilattices, Lemma 1 can benstated without dealing with covariance. As a matter of fact, we can actuallynhave two possible cascaded operations, one covariant and one contravariant,nwith zero and identity exchanged.

[B1] False negative indicates that we have to read the master databasenexcessively, while false positive leads to miss a count. As all professionalnprogrammers, we resort to partial caches with LRU or TTL setting.

C. Application of FDLs

From distributed-computing perspective, all FDLs have excellentnproperties. I will name a few:

1. nLocality and Scalability, value of FDLsncan be aggregated from distributed nodes with fidelity, which is common for mostnmonoids

2. nDeterministic and Insensitive to Order, value of FDLs can be aggregated in any order, while most other datanstructures demand at least implicit sorting

3. nAuto-deduplication, value of FDLsncan be aggregated from overlapping data partitions with fidelity, which isnquite unusual and convenient to fault tolerance

Because of those properties, FDLs are greatntemplates for large-scale data processing, especially for MPP. There is no neednfor either orthogonal partition or centralized synchronization. When the updatenoperation of FDLs are cheap and in-memory, they are also applicable tonreal-time streaming, or so called Complex Event Processing.

For most FDLs with fixed storage and datanlayout, all operations on them should be linear to the size of data structure.nMore specifically, if the elements are sparse, the cost of operations can benlinear to the actual size of them, so that the operations on those FDLs arenactually linear to data I/O. It』s not a secret that programmers of large distributednsystems love Bloom Filters and HyperLogLog counters[C1].

IMHO, an instance of FDLs is an informationnmarker which can be monotonically refined by data under the primary operation.nThere is no information loss if and only if we just want to retain the upper boundnin the corresponding semilattice. In terms of distributed computing, such anninstance can be easily implemented with good scalability. Whenever applicable[C2], FDLs are almost optimal in terms of computational efficiency fornsuch applications.

[C1] It』s not so obvious that most LogLog variants are FDLs, and I』llndiscuss this later.

[C2] We can always findncounter examples, such as simple distributed counters for non-cardinal value.


推薦閱讀:

數學中的幾個概率啟發式「證明」
Incommensurable Possibilities of Mathematics
似是而非的答案:概率論悖論 | 張天蓉專欄(二)
Gossip
泛函分析觀點下的隨機積分 (完):隨機控制收斂與It?公式

TAG:概率论 | 数据结构 | 抽象代数 |