Kolmogorov Axiomatics of Probability Theory
Events as sets and probability as measure on a
The Kolmogorov approach emphasizes the representation of random events through subsets of a fundamental set, known as the sample space (Ω) This sample space encompasses all potential outcomes of a given experiment, with individual outcomes referred to as elementary events.
To effectively represent random events, the collection of subsets must be rich enough to support set-theoretic operations like intersection, union, and difference, while also being limited to avoid including unreasonably interpretable "events." An overly extensive system of subsets may lead to ambiguities that hinder meaningful analysis, as discussed in Chapter 3 regarding verification.
After selection of a proper system of sets for events representation, we shall assign weights to these subsets:
The probabilistic weights are chosen to be nonnegative real numbers and normalized to sum up to 1: P(Ω) = 1,the probability that anything indeed happens equals one.
When conducting an experiment of tossing a coin n times, each outcome can be represented as a vector ω = {x1, , xn}, where each xj is either H (heads) or T (tails) This experiment generates a sample space containing 2^n possible outcomes, reflecting the observable results of the coin's sides rather than any hidden parameters related to the coin or the person tossing it Further exploration of this concept will be provided later.
Interpretation: An event with large weight is more probable (it occurs more often) than an event with small weight.
In the context of probabilistic weights, a key feature is additivity, which states that the weight of an event A, represented as the disjoint union of events A1 and A2, is equal to the sum of the weights of these individual events.
P(A1∪A2) =P(A1) +P(A2), A1∩A2=∅ (1.2) There is evident similarity with properties of mass, area, and volume.
It is useful to impose some restrictions on the system of sets representing events:
• (a1) set Ω containing all possible events and the empty set ∅ are also events (something happens and nothing happens);
• (a2) the union of two sets representing events represents an event;
• (a3) the intersection of two sets representing events represents an event;
• (a4) the complement of a set representing an event, i.e., the collec- tion of all points that do not belong to this set, again represents an event.
Definition 1 A set-system with properties (a1)-(a4) is called an alge- bra of sets (in the American literature, a field of sets).
Set-theoretic operations mirror the fundamental operations of classical Boolean logic, where ơ represents negation (NOT), ∧ signifies conjunction (AND), and ∨ indicates disjunction (OR) In contemporary set theory, events are represented as mappings of propositions onto sets, maintaining the logical structure The associated set-theoretic operations are symbolized by the complement and intersection (∩).
In the early development of probability theory, the foundational map was established within an algebraic framework that aligns with logical structures, specifically Boolean algebra, which was pioneered by J Boole, the originator of Boolean logic.
8 We recall that {A = {ω ∈ Ω : ω 6∈ A} It is also convenient to use the operation of the difference of two sets: A \ B = {ω ∈ A : ω 6∈ B}, i.e., A \ B = A ∩ {B.
George Boole developed a mathematical model to represent the laws of the mind, leading to the creation of Boolean logic and probability theory He believed that the principles governing human thought and probability are intricately linked Consequently, students should understand that "classical probability," particularly within a set-theoretic framework, is fundamentally rooted in classical (Boolean) logic Furthermore, any deviations from classical probability may result in alterations to classical logic, and vice versa.
In probabilistic considerations an important role is played byDe Mor- gan’s laws:
• The negation of a conjunction is the disjunction of the negations.
• The negation of a disjunction is the conjunction of the negations.
The rules can be expressed in formal language with two propositionsAand
B as: ơ(P∧Q) ⇐⇒ (ơP)∨(ơQ), ơ(P∨Q) ⇐⇒ (ơP)∧(ơQ) (1.3)
In the set-theoretic representation, De Morgan’s laws have the form:
De Morgan's laws illustrate that in the definition of a set algebra, it suffices to use either condition (a2) or (a3) based on the set-theoretic representation This highlights the flexibility in the formulation of set operations, demonstrating that the union and intersection of sets can be effectively defined through these foundational principles.
Definition 2 Let F be an algebra of sets An additive map à: F →
The role of countable-additivity (σ-additivity)
In the context of finite sample space Ω, the map defined by (1.1) exemplifies Kolmogorov’s measure-theoretic probability This model is particularly beneficial for a wide range of applications, as Ω can encompass billions of points, represented as Ω = {ω1, , ωN} Understanding this mapping is crucial for various probabilistic analyses.
(1.1), it is enough to assign to each pointω∈Ω its weight
Then, by additivity this map is extended to the set-algebra consisting of all subsets of Ω :
If the sample space Ω is countable, meaning it is infinite and its points can be enumerated, or continuous like a segment of the real line R, then simple additivity is inadequate for developing a robust mathematical model Instead, the mapping must exhibit σ-additivity, which ensures additivity with respect to countable unions of disjoint events.
10 Here σ is a symbol for “countably”. and to work fruitfully with such maps (e.g., perform integration), one has to impose special restrictions on the system of sets representing events.
We need not just a set-algebra, but aσ-algebra of sets(in the American literature, a σ-field).
Definition 1σ.An algebra of sets such that the generalizations of con- ditions (a2) and (a3) hold for countable unions and intersections of sets is called a σ-algebra of sets.
In the logical terms, it means that, new events (propositions) can be formed by applying the operations OR and AND countably many times.
We remark that in set theory De Morgan’s laws hold for any system of sets
(A i ), where i∈I andI is some set of indices (maybe even uncountable).
Therefore in the definition of a σ-algebra we can assume either closeness with respect to countable unions or countable intersections.
Definition 2σ.Let F be a σ-algebra of sets.
• σ-additive mapà:F →[0,+∞),i.e., with the condition of equality (1.5) holding for sequences of disjoint sets belonging F, is called a σ-additive measure;
• a probability measure P is a σ-additive measure normalized by 1,
Thus by definition any probability measure isσ-additive.
Remark 1 Of course, this is a mathematical idealization of the real situation Kolmogorov pointed out [177–179] that, since in real experiments it is impossible to “produce” infinitely many events, this condition is not experimentally testable One may prefer to proceed with finite-additive probabilities However, without σ-additivity it is difficult (although pos- sible) to define the integral with respect to a probability measure (the
Lebesgue integral is well defined only forσ-additive measures) and, hence, to define the mathematical expectation, the operation of averaging J L.
Doob emphasizes the relevance of finitely but not countably additive set functions in certain real-world contexts, highlighting a significant aspect of the measure-theoretic approach to probability This issue is also present in other probability models, particularly in von Mises' frequency approach, which warrants further exploration.
One of the most important “continuous probability models” is based on the sample space Ω = R, i.e., elementary events are represented by real numbers, see section 3.1, for details.
Probability space
Let Ω be any set Consider a σ-algebra of its subsetsF and a probability measure P on F The Kolmogorov axiomatics (1933) [177] of probability theory can be presented in the following compact form:
Pointsωof Ω are said to beelementary events, elements ofFarerandom events,P isprobability.
Remark 2 (Elementary events and random events) The terminology invented by Kolmogorov is a bit misleading In fact, one has to distinguish between elementary and random events The crucial point is that, in gen- eral, a single point set A ω ={ω}, whereω is one of the points of Ω,must not belong to theσ-algebra of events F.In this case we cannot assign the probability value toA ω Thus, some elementary events are, so to say, hid- den; although they are present mathematically at the set-theoretical level, we cannot assign to them probability-values One can consider the presence of such hidden elementary events as a classical analog of hidden variables in QM, although the analogy is not complete.
Example 1 Let us consider the sample space Ω = {ω1, ω2, ω3} and let the collection (algebra) of events F consist of the following subsets of
Ω :{∅,Ω, A={ω1, ω2}, B={ω3}},andP(A) =P(B) = 1/2.Here the ele- mentary single point events{ω 1 },{ω 2 } are not “physically approachable”.
(Of course, this is an indirect analogy.) Interpretations of the existence of
In classical probability theory, "hidden elementary events" received limited attention, as there are instances within the probability space P where it is fundamentally impossible to assign probabilities, including zero, to certain events This issue will be revisited through an examination of the classical probabilistic representation of observables.
Elementary Properties of Probability Measure
Consequences of finite-additivity
In this section, we focus on the fundamental aspect of probability known as additivity, excluding normalization by 1 in all formulas except for (1.12) We will discuss some basic consequences of finite-additivity, while the important property of monotonicity resulting from σ-additivity will be addressed in section 1.3.3 For a deeper understanding of the significant implications of σ-additivity, refer to Chapter 3.
Finite-additivity implies the following properties:
P(∅) = 0, (1.6) i.e., the probability zero is assigned to the event of absence of any elemen- tary event.
Proof Since ∅ = ∅ ∪ ∅ and ∅ ∩ ∅ = ∅, we have P(∅) = P(∅ ∪ ∅) P(∅) +P(∅),and hence (1.6) takes place.
For any pair of eventsA, B∈ F,
This property generalizes the additivity formula for the disjoint events.
Proof We use the following decompositions: A∪B = (A\B)∪(B\
(1.7) can be represented asP(A\B) +P(B\A) +P(A∩B),and RHS as
For any event B ∈ F, its complement{B = Ω\B also belongs to F.
This trivial representation of the probability P(A) has very important im- plications; for example, the formula of total probability and Bell’s inequal- ity.
Proof Here we use the representationA= (A∩B)∪(A∩{B).
Additivity of probability in combination with its non-negativity implies that for any pair of eventsA, B∈ F, A⊂B,
Proof SinceB=A∪(B\A), P(B) =P(A)+P(B\A).Non-negativity of probability implies (1.9).
From (1.7) we immediately obtain that
P(A i ), (1.11) whereA i ∈ F.SinceP is not only finite-additive, butσ-additive, inequality
(1.11) takes place for any countable union of events.
Finally, we remark that, sinceP(Ω) = 1, we have
P({A) = 1−P(A) (1.12) Exercise 1 Show that, forA, B∈ F,the following inequality holds:
|P(A)−P(B)| ≤P(A∆B), (1.13) where the symmetric difference of two events is defined as A∆B = (A∪
Exercise 2 Generalize equality (1.7) to the case of three sets, and then to the case ofnsets,n= 2,3,
In our analysis, we emphasize the significance of the Boolean structure inherent in the system of events, which can be mathematically represented by a σ-algebra This set-theoretic framework is crucial for understanding the relationships and interactions among various events.
(together with additivity and non-negativity of probability) is very natural.
It may seem that it is given from God However, this is the idea of a human
While no one can ensure that all random experiments are encompassed by their axiomatics, the characteristics and implications of probability impose limitations on the Kolmogorov probability model's applicability If a particular random experiment fails to meet these constraints, it indicates the necessity for a new probability model tailored to that experiment This scenario leads to the concept of non-Kolmogorovian probability, as discussed in various references.
R Feynman was the pioneer in highlighting the connection between classical probability and quantum physics, as noted in the preface He emphasized that the probabilistic data obtained from the two-slit experiment contradicts the established equality (1.8).
He interpreted quantum interference of probabilities as a perturbation of this equality.
Feynman's argument has been reinterpreted through conditional probabilities in several of my publications, including works [141], [142], [152], and [161] Notably, the equality (1.8) aligns with the fundamental principle of classical probability known as the formula of total probability (FTP), as discussed in section 1.6.
Therefore in the probabilistic terms quantum interference is nothing else but a violation of FTP, see Chapter 5.
Now we derive the most famous constraint for classical probabilities widely known as Bell’s inequality [31].
Bell’s inequality in Wigner’s form
Consider three eventsA, B, C.It is convenient to use the notations
Theorem 1 (Bell-Wigner inequality)For any triple of eventsA, B, C, the following inequality holds:
In this proof, we will apply equality (1.8) to each term of (1.14), highlighting its significance Specifically, the event A + ∩ B + serves as the equivalent of A in (1.8), while the event C+ corresponds to B in the same equality.
In the same way we obtain
By adding the first two equalities we come to the expression
Commutativity of the operation of intersection implies thatP(A+∩B+) +
P(B − ∩C+) equals toP(A+∩C+) plus a nonnegative term Hence, (1.14) holds.
In deriving equation (1.14), we relied solely on the principle of finite-additivity in probability Chapter 8 will introduce an alternative form of Bell-type inequalities and explore the physical implications of their violation in experiments involving quantum systems.
Monotonicity of probability
Theorem 2 For any monotonically decreasing sequence of eventsA1⊃
P(A) = lim n→∞P(An), A=∩ ∞ n=1 An (1.18) Proof We can always assume that A =∅, otherwise we can operate with the sequenceA 0 n =A n \A.We have
A1= (A1\A2)∪(A2\A3)∪ , , An = (An\An+1)∪(An+1\An+2)∪
Therefore the σ-additivity ofP implies that
The series forP(A 1 ) converges and the series forP(A n ) is the remainder of the series forP(A1).The remainder of any convergent series goes to zero.
In the same wayσ-additivity implies:
Theorem 3 For any monotonically increasing sequence of eventsA1⊂
In contemporary probability theory, the consequences of σ-additivity in probability measures are widely accepted without hesitation However, Kolmogorov emphasized that σ-additivity and its implications are theoretical ideals, warranting caution when applied in real-world scenarios.
Kolmogorov highlighted that while σ-additivity cannot be experimentally tested, he selected it as the foundation of his theory due to its mathematical elegance and simplicity.
R von Mises advocated for a more complex mathematical formalism that allows for countable operations on events, leading to "events" that cannot be consistently assigned a probability His position, outlined in section 3.3.1, emphasizes the importance of mathematical rigor over aesthetic simplicity in probability theory.
Random Variables
In Kolmogorov's classical probability theory, random observations are modeled as random variables, which serve as specific mappings from the space of elementary events to real numbers or other general spaces, often equipped with topological structures.
In Euclidean space R^m, maps representing random variables play a crucial role in establishing a robust integration theory related to probability measures, enabling the determination of numerical characteristics such as averages and dispersion Inspired by quantum mechanics, future studies will focus on mathematical models of random measurements, as detailed in Chapter 5 For now, it is important to highlight that in classical probability theory, random observations are mathematically represented through maps, beginning with an introduction to the simplest form of random variables: discrete random variables.
Definition 3d A discrete random variable on the Kolmogorov space
P is a functiona: Ω→X a ,where X a ={α 1 , , α n , } is a countable set
(the range of values) such that all sets
Thus, as stressed above, classical probability theory is characterized by thefunctional representation of observables.
For each elementary eventω ∈Ω, there is assigned a value of the ob- servable a,i.e.,α=a(ω) This is thea-observation on realizationω of the experiment.
It is typically assumed that the range of values Xa is a subset of the real line We will proceed under this assumption.
Remark 3 (Observations and hidden elementary events) Suppose the system of eventsF does not contain the single-point set for someω0∈Ω.
The probability of the ω0 outcome in a random experiment cannot be assigned, yet this realization can occur, and each observable's value is well defined as a(ω0) However, we are unable to extract information about this elementary event using the class of observables associated with the chosen probability space.
Example 1; let ω → a(ω) be a random variable, α j = a(ω j ) Suppose
12 In many applications, e.g., in radio-engineering and telecommunication, optimization theory, random variables take values in infinite-dimensional spaces, e.g., in Hilbert or
Banach spaces play a significant role in the theory of random variables, which is well-established Although this book does not intend to delve into this topic, it is important to acknowledge the relevance of infinite-dimensional random variables, particularly those that take values in the space of square-integrable functions, L².
In sections 4.1 and 5.2.2, we will formally introduce Hilbert space, drawing an analogy to the theory of R-valued random variables discussed in this section It is important to note that when α1 is not equal to α2, the sets E α a j are defined as {ωj}, highlighting the significance of single point sets in this context.
{ωj}, j= 1,2,do not belong toF.Hence, we have to assume thatα 1 =α 2
(otherwise subsets E α a j ={ωj} should be measurable, sinceais a random variable) And this is the general situation: we cannot distinguish such elementary events with the aid of observations.
The probability distribution of a (discrete) random variableais defined as p a (α)≡P(ω∈Ω :a(ω) =α).Theaverage (mathematical expectation) of a random variable ais defined as ¯ a≡Ea=α 1 p a (α 1 ) + +α n p a (α n ) + (1.21)
If the set of values of ξ is infinite, then the average is well defined if the series in the right-hand side of (1.21) converges absolutely.
In cases where the outcomes of random measurements cannot be represented by a finite or countable set, it indicates that the observable in question cannot be described by a discrete random variable.
A random variable on the Kolmogorov space P is defined as a function ξ: Ω→R, which ensures that for any set Γ in the Borel σ-algebra of the real line, its pre-image is included in the σ-algebra of events F Specifically, this means that ξ −1 (Γ) = {ω ∈ Ω: ξ(ω) ∈ Γ} belongs to F.
In the Kolmogorov probability space P, the space of random variables is represented by R(P), which is crucial for establishing the compatibility conditions of observables, especially in the context of quantum observables (refer to section 5.5).
In the non-discrete scenario, the mathematical framework becomes significantly more complex, primarily due to the challenge of defining integrals in relation to a probability measure on F, specifically the Lebesgue integral Classical probability theory, grounded in measure theory, presents greater mathematical intricacies compared to quantum probability theory.
In this context, the trace operation replaces integration for linear operators, with the trace consistently representing a discrete sum We will not delve into the theory of Lebesgue integration, but instead, we will formally introduce the symbol of integration.
For a discrete random variable, the integral aligns with the sum when calculating mathematical expectation A discrete random variable is considered integrable if its mathematical expectation is clearly defined Moreover, any integrable random variable can be approximated by a sequence of integrable discrete random variables, with the integral defined as the limit of the integrals of this approximating sequence.
Theaverage (mathematical expectation) of a random variableais de- fined as ¯ a≡EaZ
The probability distribution of a random variable \( a \) is defined for Borel subsets of the real line as \( p_a(\Gamma) \equiv P(\omega \in \Omega : a(\omega) \in \Gamma) \), establishing a probability measure on the Borel σ-algebra Consequently, the calculation of the average can be simplified to integration with respect to \( p_a \), represented as \( \bar{a} \equiv E_a = \int_R \).
Conditional Probability; Independence; Repeatability
Kolmogorov’s probability model is based on a probability space equipped with the operation of conditioning In this modelconditional probability is defined by the well-knownBayes’ formula
By Kolmogorov’s interpretation it is the probability of an eventB to occur under the condition that an event C has occurred We emphasize that the
Bayes formula is fundamentally a definition rather than a theorem or an axiom, as emphasized by Kolmogorov In the context of the frequency approach to probability, however, it is regarded as a theorem.
We remark that conditional probability (for the fixed conditioning
C)) P C (B) ≡ P(B|C) is again a probability measure on F For a set
C ∈ F, P(C) > 0, and a (discrete) random variable a, the conditional probability distribution is defined as p a C (α) ≡ P(a = α|C) We natu- rally have p a C (α1) + +p a C (αn) + = 1, p a C (αn) ≥0 The conditional expectation of a (discrete) random variable a is defined by E(a|C) α 1 p a C (α 1 ) + +α n p a C (α n ) +
Again by definition two eventsAandB areindependentif
In the case of nonzero probabilities P(A), P(B) >0 independence can be formulated in terms of conditional probability:
The relation of independence is symmetric: if Ais independent ofB, i.e.,
(1.25) holds, then B is independent ofA,i.e., (1.26) holds, and vice versa.
(We remark that this property does not completely match our intuitive picture of independence.)
We now discuss an important feature of observables represented by ran- dom variables, namely, repeatability We consider the following problem.
Suppose that some discrete observableawas measured and the valuea=α was registered What is the probability to obtain the same value a=αin successive measurement of the same observable a?
In the Kolmogorov measure-theoretic framework we have (for discrete random variables):
P(a=α|a=α) =P(E α a |E α a ) =P(E α a ∩E α a )/P(E α a ) = 1, where E α a = {ω ∈ Ω : a(ω) = α}, α ∈X a Here we used the idempotent feature of the Boolean operation of conjunction: A∩A=A, for any setA.
In the same way, fori6=j, we have
Here we used mutual disjointness of the setsE α a representing values of the observablea.
Thusobservables represented by random variables have the property of repeatability.
To effectively compare the repeatability conditions in the Kolmogorov model and the contextual probability model, it is beneficial to present these conditions in a structured format.
The repeatability of classical observables represented by random variables is a fundamental concept in the measure-theoretic framework, often taken for granted However, its validity can be challenged, and violations of this property can limit the applicability of the Kolmogorov model in describing random measurements, particularly when compared to the behavior of quantum observables.
Formula of Total Probability
In our further considerations the important role will be played by the for- mula of total probability(FTP) This is a theorem of the Kolmogorov model.
Let us consider a countable family of disjoint sets A k belonging toF such that their union is equal to Ω and P(Ak)>0, k = 1, Such a family is called apartition of the space Ω.
Theorem 4 Let {Ak} be a partition Then, for every set B∈ F, the following formula of total probability holds
P(B) =P(A1)P(B|A1) + +P(Ak)P(B|Ak) + (1.28) Proof We have
Particularly interesting for us is the case when the partition is induced by a discrete random variable ataking values{αk}.Here,
Let b be another random variable It takes values {β j } For any β ∈X b , we have
The fundamental principle of Bayesian analysis involves estimating the probability of obtaining a specific result, β, in the b-measurement, based on the conditional probabilities associated with the outcomes of the a-measurement.
Law of Large Numbers
Theorem 5 (Kolmogorov Strong Law of Large Numbers [175], [177])
Let ξ 1 , , ξ N , , be a sequence of identically distributed and independent random variables with average m=Eξj.Then
N =m almost everywhere (on the set of probability 1).
The fundamental law ensures the Kolmogorov measure-theoretic model's relevance to experimental data, indicating that the arithmetic average converges to the measure-theoretic average for almost all trial sequences This significant finding, proven by Kolmogorov, validates the model's application in statistical analysis.
The Kolmogorov strong law of large numbers suggests that the set of elementary events where the arithmetic mean converges to the probabilistic mean is extensive from a measure-theoretic perspective However, for a specific sequence of experimental trials (ω), this law does not offer insights into whether such convergence actually occurs This limitation was a key point raised by von Mises, the founder of frequency probability theory, in his critique of the Kolmogorov measure-theoretic model.
It is theoretically possible for a coin toss to result in a long sequence of heads without any tails During discussions in my lectures in Vienna, Anton Zeilinger emphasized that such repeatability could also occur in quantum experiments From the perspective of contemporary measure-theoretic probability, his assertion is entirely valid.
The author expresses skepticism about the conventional interpretation of the law of large numbers, suggesting that any unusual experimental outcomes would likely be attributed to issues in preparation rather than mathematical anomalies They argue that the frequency interpretation of measure-theoretic probabilities presents challenges within the Kolmogorov model, indicating a need for reevaluation in understanding probability theory.
Consider now some event A, i.e., A ∈ F, and generate a sequence of independent tests in which the event A either happens or not In the
The Kolmogorov model characterizes a series of experimental trials using independent random variables, where ξj equals 1 if event A occurs and 0 otherwise The frequency of occurrence of event A, denoted as νN(A), is calculated as the ratio of the number of times A occurs, n(A), to the total number of trials, N This can be expressed mathematically as νN(A;ω) = ξ1(ω) + + ξN(ω), illustrating the cumulative outcomes of the trials.
Hence, the strong law of large numbers, in particular, implies that proba- bilities of events are approximated by relative frequencies:
In the Kolmogorov model, the frequency interpretation of probability is established as a theorem, applicable almost everywhere except for a set of measure zero This interpretation aligns with the notion that as the sample size approaches infinity, the probability converges, reinforcing the foundational principles of probability theory Notably, this perspective has been highlighted by some probabilists, including von Mises, who emphasized its significance in the broader context of probability analysis.
14 This problem is not present in the frequency approach to probability due to R von
Mises, section 1.10. the strong law of large numbers as justifying the frequency interpretation has to be taken with caution.)
The strong law of large numbers is essential for connecting the Kolmogorov model with experimental results This refined version of the law builds on the original concepts introduced by Jacob Bernoulli and Poisson, with significant contributions from Chebyshev, Markov, Borel, and Cantelli Its comprehensive proof within a general measure-theoretic framework was provided by Khinchin, leading to its alternate designation as the Khinchin law of large numbers.
Theorem 6 (Law of Large Numbers) Let ξ1, , ξN, , be a sequence of identically distributed and independent random variables with average m=Eξj Then, for any >0, lim
Physicists often focus on the probability convergence aspect of the law of large numbers, even though they frequently apply its strong version in practice However, the outcomes of Theorems 5 and 6 are fundamentally different The latter theorem assures us that while arithmetic averages may vary from the measure-theoretic average, such deviations occur with only a small probability.
In the 18th to early 20th centuries, both physicists and mathematicians interpreted the weak law of large numbers as suggesting that arithmetic averages converge to the mean value This interpretation issue is significant, as it relied on Cournot’s principle for understanding probability Following Kolmogorov's proof of the strong law of large numbers, reliance on probability convergence became unnecessary, allowing for a more direct application of interpretational principles like Cournot’s to support the frequency interpretation of probability.
Kolmogorov’s Interpretation of Probability
Andrey Kolmogorov suggested a foundational interpretation of probability, proposing that for any event A, which may or may not occur under specific conditions Σ, a real number P(A) is assigned to it This number, P(A), embodies certain characteristics that define the nature of probability in relation to events and their likelihood of occurrence.
If the conditions Σ are repeated many times, denoted as N, the ratio of occurrences of event A, represented by n/N, will closely approximate the probability P(A).
• (b) if P(A) is very small, one can be practically certain that when conditions Σ are realized only once the eventAwould not occur.”
The interpretation of probability as frequency aligns with von Mises' theory and is supported by the law of large numbers in a measure-theoretic context While the notion of "practically certain" presents challenges, as discussed previously, Kolmogorov emphasized that frequency approximation is not the sole defining characteristic of probability Additionally, Cournot’s principle highlights the significance of weight in probability assessments; events with minimal assigned weight are expected to occur infrequently or not at all.
We emphasize that Kolmogorov presented this weight-type argument in its strongest form - “never happen” One may proceed with a weaker form -
“practically never happen”, cf section 1.13): weak and strong forms of the
Random Vectors; Existence of Joint Probability
Marginal probability
The probability distribution of a vector formed by any subset of indices, such as (ai1, , aik), can be derived from the marginal distribution of the overall probability distribution pa Specifically, for discrete observables a = (a1, a2), the distribution can be expressed as P(ω ∈ Ω: ai1(ω) ∈ Ai1, , aik(ω) ∈ Aik) This relationship highlights the ability to extract specific distributions from a broader context.
Then pa 1 (α1) =X α 2 pa(α1, α2); pa 2 (α2) =X α 1 pa(α1, α2) (1.36)
These conditions can be considered as conditions of marginal consistency.
Consider now a triple of discrete observables a= (a 1 , a 2 , a 3 ) Then pa 1 ,a 2 (α1, α2) =X α 3 pa(α1, α2, α3), , pa 2 ,a 3 (α2, α3) =X α 1 pa(α1, α2, α3).
(1.37) One dimensional distributions can be obtained from the two dimensional distributions in the same way as in (4.12.1).
The fundamental assumption for deriving Bell-type inequalities is the ability to express two-dimensional probability distributions as marginals of a joint probability distribution A more detailed discussion of this concept will be provided in section 1.9.2 and further explored in Chapter 8.
From Boole and Vorob’ev to Bell
Theorem 1 (Bell-Wigner inequality) can be formulated by using random variables.
Theorem 1a Let ai, i= 1,2,3,be dichotomous random variables, i.e., a i = ±1, which are determined on the same probability space Then the following inequality holds:
In physics, the inequality is often expressed through the probability distributions of random variables, represented as pa 1 a 2 (++) + pa 2 a 3 (−+) ≥ pa 1 a 3 (++) This notation highlights the significance of a common probability space, and to further clarify its importance, we can reformulate the inequality.
(1.38) by using the complete mathematical notations:
The representation of a family of observables by random variables on a common probability space has a rich history in probability theory G Boole initially explored this issue by examining three observables, a1, a2, and a3 = ±1, along with their pairwise probability distributions, paia j He questioned whether it was possible to establish a discrete probability measure, p(α1, α2, α3), from which all paia j could be derived as marginal distributions Boole demonstrated that the inequality now known as the Bell inequality is a necessary condition for the existence of such a probability measure Consequently, it may be historically accurate to refer to this inequality as the Boole-Bell inequality, although the name Bell has become firmly associated with it in contemporary discourse.
Boole recognized that it is not always obvious that a single probability space can adequately describe any family of observables He understood that when data is only available for pairwise measurements, the inability to jointly measure triples could lead to complications in analysis.
The violation of the Boole-Bell inequality does not imply a breach of classical probability theory, which was originally formulated to address observable events and their associated probabilities The concept of probability distributions related to "hidden variables" and counterfactuals was entirely outside the scope of the early developers of probability theory.
A N Kolmogorov pointed out [178], section 2, that each complex of experimental physical conditions determines its own probability space In the general setup he did not discuss the possibility to represent the data collected for different (may be incompatible) experimental conditions with the aid of a single probability space, cf with Boole [41], [42] Therefore we do not know Kolmogorov’s opinion on this problem However, he solved this problem positively for one very important case This is the famousKol- mogorov theoremguaranteeing the existence of the probability distribution for any stochastic process, see section 1.9.4.
In 1962, Soviet probability scientist N N Vorob’ev proposed a comprehensive solution regarding the existence of joint probability distributions for any set of discrete random variables, utilizing criteria from game theory and random optimization Despite being part of a leading probabilistic school, his innovative approach, which extended beyond Kolmogorov's axioms, faced significant criticism and was largely overlooked It wasn't until 2005 that his work was rediscovered by K Hess and W Philipp, who then referenced it in the Bell debate.
We remark that in the probabilistic community the general tendency was to try to find conditions for the existence of a single probability space.
The Kolmogorov theorem about the existence of a probability space for a stochastic processes, Theorem 8, section 1.9.4, is one of the most known
The concept of "existence theorems" often overshadowed works focused on non-existence, leading to the neglect of significant contributions, such as those by G Boole These overlooked works were rediscovered around 2000 by Itamar Pitowsky, who utilized them in the Bell debate within quantum physics, highlighting their relevance in contemporary discussions.
Now we formulate the following fundamental problem (see Chapters 5,
Can quantum probabilistic data be described by the classical (Kol- mogorov) probability model?
The Boole-Bell inequality serves as a crucial criterion for the existence of a unified description of data within a single Kolmogorov probability space When this condition is met, Theorems 1 and 1a are applicable, and the inequality must be satisfied However, it is well-established that this inequality is violated in the context of quantum probabilistic data, a finding supported by both theoretical frameworks and experimental evidence.
Chapter 8) Therefore in general quantum data cannot be embedded in a
No-signaling in quantum physics
The (Boole-)Bell inequality serves as a statistical test for the Kolmogorovness of data, and we will now explore another statistical test pertinent to Bell-type experiments This discussion will focus on the condition of marginal selectivity, which is essential for establishing the joint probability distribution for the three observables considered in Bell's framework.
To satisfy the condition of pairwise marginal consistency, as outlined in equation (4.12.1), it is essential to adhere to the no-signaling conditions in discussions surrounding Bell's inequality In this context, the random variables a1 and a2 denote observables measured in two spatially separated laboratories.
15 In the light of our previous analysis this conclusion seems to be totally justified.
The complexities of the situation reveal that the distance between laboratories is significant enough that a signal regarding the measurement of a2 in lab2, traveling at the speed of light, cannot reach lab1 at the time a1 is measured Consequently, the probability distribution of a1 remains independent of the experimenter's choice of which random variable is measured in lab2.
When selecting a different random variable, such as \(a_0^2\), the joint probability distribution must align with the existing probability distribution for \(a_1\) Therefore, in addition to the previous equation, we establish the relationships: \(p_{a_1}(\alpha_1) = \sum_{\alpha_2} p_{a_1, a_0^2}(\alpha_1, \alpha_0^2)\) and \(p_{a_2}(\alpha_2) = \sum_{\alpha_1} p_{a_1, a_0^2}(\alpha_1, \alpha_0^2)\).
We remark [156] (see also Chapter 5) that a violation of the conditions of marginal consistency for pair observables is equivalent to a violation of FTP
(or additivity of probability) Suppose that outputs of equations (1.41) and
In the context of joint probability distributions, if the equalities \( q_a^1(\alpha_j) = q_0 a^1(\alpha_j) \) for indices \( j = 1, 2 \) are not satisfied, it indicates that at least one sum of the joint probability distributions, represented as \( q_a^1(\alpha_j) \), must differ from the probability distribution \( p_a^1(\alpha_j) \) This discrepancy can be quantified as \( \delta_j \equiv \delta_{a^1|a^2}(\alpha_j) = p_a^1(\alpha_j) - \sum_{\alpha_2} p_{a^1,a^2}(\alpha_1, \alpha_2) \neq 0 \).
To couple this inequality to violation of FTP, we, as usual, rewrite the sum of joint probabilities as
Quantum mechanics (QM) aligns with the no-signaling principle, as demonstrated by the calculations of joint probability distributions for compatible observables represented by commuting operators In experiments testing the violation of Bell's inequality, it is crucial to verify not only the presence of a violation but also to ensure that any signaling is statistically insignificant.
For statisticians, the term "signaling" can be misleading, as it primarily refers to the examination of marginal consistency This crucial aspect was often overlooked in the experiments testing Bell's inequality Notably, our findings, in collaboration with G Adenier who conducted the majority of the data analysis, revealed that the data from landmark experiments by A Aspect and G Weihs demonstrate significant signaling, as detailed in Weihs' explanation of this statistical phenomenon.
In the quantum community, we have highlighted the critical issue of signaling and dedicated significant efforts to gather new data for analysis In a preprint, I expressed my concerns regarding the common oversight of statistical analysis related to signaling It perplexes me that researchers can celebrate violations of Bell's inequality while ignoring the implications on the signaling landscape.
Recent experiments have examined the no-signaling hypothesis, with findings suggesting it cannot be rejected at a high level of statistical significance This book focuses solely on foundational issues related to probability and randomness, deliberately excluding statistical concerns, which present their own significant foundational challenges.
The issue of statistically significant rejection of signaling is complex, primarily concerning the magnitude of the coefficients δj that need to be assessed for significance I contend that in actual experiments, these coefficients are likely to be positively substantial Therefore, it is essential to establish a threshold and determine if the experimental data supports the rejection of the hypothesis H0 = {δj ≥} For me, selecting an appropriate threshold that accurately represents the experimental conditions influencing signaling presents a significant challenge.
E Dzhafarov has recently proposed a method to address the violations of signaling and Bell's inequalities, aiming to verify the hypothesis that such violations do not stem from signaling He introduced this approach through a novel framework.
Despite claims from experimenters that their data was free from signaling and would be shared, we faced difficulties in obtaining it despite repeated requests Other researchers indicated that signaling is an inherent issue in data collection, leading to frustration on my part Consequently, I published an arXiv preprint stating that experiments without open access data should be considered inconclusive To my surprise, following this publication, leading experimental groups began to promise the release of their data after their papers were published While it’s possible they took my statement seriously regarding the validity of Weihs’ experiment, the timing may also be coincidental Notably, prior to my preprint, Weihs’ experiment was the only Bell-type experiment with accessible data, which later also became unavailable.
17 Dzhafarov is a psychologist He actively worked on the problem known as marginal inequality modifying Bell’s inequality, see, e.g., the work of Dzhafarov and
This approach parallels my research on the relationship between the violation of Kolmogorovness and the violation of Bell's inequality, a topic that has been largely overlooked by the quantum community, as noted in several of my publications In Theorem 8 (8a), the existence of a shared probability space \( P = (\Omega, F, P) \) for all observation pairs is essential Building on my previous works, I propose a family of probability spaces where only the probability measures differ, simplifying the analysis.
P, there is a family (Pu) In probability theory we often explore different distances between probabilities, see, e.g [233] Take one of them ρ.Then γ = sup u,v ρ(P u , P v ) can be used as a measure of non-Kolmogorovness.
Modified Bell's inequalities have been derived with deformations represented by γ, indicating that in practical experiments, researchers must assess these deformed inequalities rather than the original Bell inequality This necessity arises from the inability to ensure a uniform probability measure across all measurement pairs One indicator of this limitation is signaling, which reflects non-Kolmogorovness, as discussed in Chapter 5.
Kolmogorov theorem about existence of stochastic
The notion of a random vector is generalized to the notion of a stochastic process Suppose that the set of indices is infinite; for example, at, t ∈
[0,+∞).Suppose that, for each finite set (t 1 t k ),the vector (a t 1 a t k ) can be observed and its probability distribution pt 1 t k is given By selecting
In the early 20th century, a key question in probability theory was whether it was feasible to establish a unified probability space, denoted as P = (Ω, F, P), where all measurements at points t1 tk could be represented as random variables This would imply that all probability distributions p t1 tk are derived from this singular space, thus creating a coherent framework for understanding probability across different contexts.
Kolmogorov established natural conditions for the system of measures that ensure the existence of a probability space This relates to the issue of marginal (in)consistency, which can also be understood as (no-)signaling in quantum physics.
Theorem 7 Let a family of probability distributions (pt 1 t k ) satisfy the following conditions:
• For any permutation s 1 s k of indices t 1 t k , p t 1 t k (A t 1 × ×A t k ) =p s 1 s k (A s 1 × ×A s k ) (1.44)
• For two sets of indices t1 tk andr1, , rm, pt 1 t k r 1 r m (At 1 × ×At k ×R× ×R) =pt 1 t k (At 1 × ×At k ).
Then there exist a probability space P = (Ω,F, P)and a stochastic process at(ω) on it such that the equality (1.43) holds The conditions (1.44) and
(1.45) are also necessary for the existence of a stochastic process.
During the next 80 years analysis of properties of (finite and infinite) families of random variables defined on one fixed probability space was the main activity in probability theory.
In Kolmogorov's framework for establishing a probability space for stochastic processes, the elementary event space Ω is defined as the collection of all trajectories t → ω(t) The random variable at is then represented as a function of time, specifically a_t(ω) = ω(t).
Construction of the probability measureP serving all finite random vectors is a mathematically sophisticated task going back to construction of the
Wiener measure on the space of continuous functions.
In exploring the relationship between classical and quantum probabilities, it is essential to recognize that describing an elementary event requires tracking the entire trajectory of the process This suggests that such an event is fundamentally shaped by experimental data and the measurement of the stochastic process at different time points Consequently, the Kolmogorov construction of a stochastic process relies not on “hidden variables,” such as initial conditions that predetermine future measurement outcomes, but rather on the actual results obtained from those measurements.
Frequency (von Mises) Theory of Probability
Von Mises (1919) theory was the first probability theory [247] - [249] based fundamentally on the principle of the statistical stabilization of frequencies.
From the inception of probabilistic studies, the principle of randomness has been utilized heuristically; however, it was von Mises who sought to mathematically formalize it and establish it as a fundamental tenet of probability theory His work revolves around the concept of a collective, which refers to a random sequence.
In a random experiment S, the set of all possible outcomes is represented as L = {α1, , αm}, known as the label set or attribute set of the experiment We focus on finite sets L and conduct N trials of the experiment S, recording the results as xj ∈ L This process yields a finite sample represented as x = (x1, , xN), where each xj is an element of L.
A collectiveis an infinite idealization of this finite sample: x= (x1, , xN, }, xj ∈L, (1.48) for which the following two von Mises principles are valid.
This is the principle of the statistical stabilization of relative frequencies of each attribute α∈L of the experiment S in the sequence (1.48) Take the frequencies νN(α;x) = nN(α;x)
In the context of statistical analysis, the frequency of an attribute α in the first N trials, denoted as νN(α;x), demonstrates that as N increases towards infinity, this frequency stabilizes and approaches a specific limit for every label α within the set L.
N→∞νN(α;x) is called the probability of the attribute αof the random experimentS.
In certain situations where the collective is established, the probability may be represented as P(α) However, this notation may provoke criticism from von Mises, who argued that using abstract probabilistic symbols disconnected from random experiments is meaningless and can result in paradoxical outcomes.
“We will say that a collective is a mass phenomenon or a repetitive event, or simply a long sequence of observations for which there are sufficient reasons to
R von Mises viewed probability theory not merely as a mathematical construct but as a physical theory, akin to hydrodynamics He asserted that the foundation of probability lies in physical experiments, positioning it within the realm of physics rather than mathematics.
He was criticized for mixing physics and mathematics But he answered that there is no way to proceed with probability as a purely mathematical entity, cf with remark of A.
In section 2.4 of Zeilinger's work, it is proposed that as observations of a specific attribute are extended infinitely, the relative frequency of that attribute will converge towards a stable value This stable value is defined as the probability of the attribute within the examined collective.
It is clear that the sequence x = (0, 1, 0, 1, 0, 1, ) cannot be viewed as the result of a random experiment Nevertheless, the principle of statistical stabilization applies to this sequence, with probabilities P x (0) and P x (1) both equal to 1/2 To treat such sequences as elements of probability theory, we must impose an additional constraint on them.
The limits of relative frequencies have to be stable with respect to a place selection (choice of a subsequence) in (1.48).
In particular,xdoes not satisfy this principle 19
The concept of randomness, while seemingly straightforward, posed a significant challenge to the foundations of von Mises' theory The critical issue was to establish a set of place selections that would support a robust theoretical framework Von Mises introduced a fundamental restriction, emphasizing that any place selection must adhere to specific criteria to be effective.
In the context of (1.48), it is important to note that the selection cannot rely solely on the attributes of the elements For instance, one cannot derive a subsequence of (1.48) by simply selecting elements associated with a specific label α from the set L Von Mises introduced a method for place selection that addresses this issue.
A subsequence is formed through place selection when the choice to keep or discard the nth element relies on the index n and the label values of the preceding elements x1 to xn−1, rather than on the label value of the nth element or any subsequent elements.
A place selection can be characterized by a series of functions, such as f1, f2(x1), f3(x1, x2), and so on, where each function outputs either 0, indicating the rejection of the nth element, or 1, signifying its retention This systematic approach ensures that any place selection effectively determines which elements to include or exclude.
A Zeilinger noted that while a sequence may have a deterministic structure, it can theoretically arise from an inherently random experiment, even if the probability of such an outcome is negligible This perspective aligns with the Kolmogorov measure-theoretic approach, contrasting with the von Mises frequency approach, which would question the randomness of the experiment if such a sequence were observed Additionally, an infinite input sequence must yield an infinite output sequence that adheres to specific conditions, such as fn(x1, , xn−1) = 1 for infinitely many n.
• choose thosexn for whichnis prime;
• choose thosex n which follow the word 01;
• toss a (different) coin; choosexn if the nth toss yields heads.
The initial two selection methods are systematic and law-like, while the third method is random Importantly, all these procedures are based on location selection, as the value of xn does not influence the decision to select xn.
The principle of randomness dictates that no selection strategy can produce a subsequence with different probabilities than one generated by a fair coin flip, a concept referred to as the law of excluded gambling strategy This principle underscores the inherent unpredictability in gambling outcomes, as noted by Feller.
Many gamblers have learned through painful experiences that no betting system can enhance their chances of winning This crucial insight was first articulated by economist Ludwig von Mises, who established the impossibility of a successful gambling system as a core principle.
Subjective Interpretation of Probability
This section, along with sections 1.12 and 1.13, addresses complex interpretational challenges regarding probability Readers may choose to skip to Chapter 2; however, it's essential to understand that from a subjectivist perspective, probability P(A) represents the degree of personal belief in the occurrence or non-occurrence of event A This concept was notably articulated by de Finetti, a key figure in subjective probability theory.
"Probability does not exist as an objective entity." For readers interested in exploring Chapter 6 on interpretations of Quantum Mechanics (QM), this section serves as valuable preliminary training, particularly in understanding the information interpretation of QM and QBism However, if lengthy discussions on interpretational issues are overwhelming, feel free to skip ahead.
We have already presented two basic mathematical models of probabil- ity, von Mises’ frequency model and Kolmogorov’s measure-theoretic model.
The von Mises approach has been pivotal in the advancement of the theory of individual random sequences, despite the first concept being largely overlooked today In contrast, the second concept is extensively applied across various fields, including engineering, telecommunications, statistical physics, chemistry, biology, psychology, and social science.
Randomness can be understood as unpredictability, highlighting the challenges of developing a successful gambling system However, this concept evolves through mathematical formalization, where it is further defined by two significant approaches: randomness as complexity, introduced by Kolmogorov, and randomness as typicality, proposed by Martin-Löf For a deeper exploration of these concepts, refer to Chapter 2.
Scientific theories comprise two key components: a mathematical model and an interpretation of its mathematical elements Both von Mises and Kolmogorov's models provide a statistical interpretation of probability, where repeating the same experimental conditions multiple times leads to outcome frequencies that approximate their corresponding probabilities A notable distinction between the two approaches lies in their foundational definitions of probability; von Mises views the statistical interpretation as the core of probability, while Kolmogorov defines probability as a measure, necessitating the proof of the strong law of large numbers to support the frequency interpretation Additionally, defining probability as a measure can be understood as assigning weights to elementary events.
Kolmogorov's definition of probability builds upon Laplace's approach by emphasizing the weighting of events, moving beyond the equal weighting of all elementary outcomes Instead of adhering strictly to a measure-theoretic interpretation, Kolmogorov favored a frequency interpretation of probability, aligning with von Mises' concept of "genuine frequency interpretation," which is supported by the law of large numbers.
When a specific set of conditions, denoted as Σ, is repeated numerous times (N), the frequency of event A occurring (ifn) will yield a ratio (ifn/N) that closely approximates the probability of event A (P(A)).
In developing his measure-theoretic axiomatization of probability theory, Kolmogorov was significantly influenced by von Mises, which suggests that the frequency interpretation may also stem from von Mises' theories A key criticism of the weight interpretation is that it lacks heuristic appeal when applied to continuous sample spaces, such as Ω = [a, b], Ω = R, or Ω = C([a, b]).
(the space of continuous functions on [a, b] which is endowed with, e.g.,
In the context of non-discrete probability measures, each individual elementary event carries a weight of zero, making it impossible to transition from a zero probability of elementary events to a non-zero probability for non-elementary events.
The weight-like perspective on probability remains present in the Kolmogorov interpretation, as evidenced by Cournot’s principle, which highlights the enduring influence of this viewpoint.
• (b) “if P(A) is very small, one can be practically certain that when conditions Σ are realized only once the eventAwould not occur.”
This (b)-part of Kolmogorov’s interpretation of probability is totally foreign to von Mises’ genuine frequency ideology.
In scientific theories, a single mathematical model can yield multiple interpretations, as exemplified by Kolmogorov's measure-theoretic model of probability While the statistical interpretation is widely recognized, probability measures can also be understood through the lens of subjective probability theory.
This interpretation was used by T Bayes as the basis of his theory of probability inference, see then Ramsey [225], de Finetti [69], Savage [226],
Bernardo and Smith highlight the distinction between subjective and objective probability, where P(A) reflects an agent's personal belief in the occurrence or non-occurrence of event A Unlike objective probability, which is grounded in statistical interpretation and exists independently of individual beliefs, subjective probability is inherently personal and does not exist in nature without an agent assigning values This perspective challenges von Mises' interpretation of probability theory as a natural science, akin to hydrodynamics, a view that is shared by Kolmogorov and many Soviet probabilists.
(see section 1.12) also took the active anti-subjectivist position Although
Kolmogorov approached probability theory as a mathematical discipline, differing from Mises' perspective He viewed probability as an objective characteristic of repeatable phenomena, emphasizing their statistical stability.
The measure-theoretic definition of Kolmogorov probability, which relies on a weighting-like procedure for events, aligns well with the subjective interpretation of probability While Kolmogorov suggested assigning objective weights to events, subjectivists advocate for personal weights, allowing each individual to determine their own degree of belief regarding an event This approach challenges the perspectives of von Mises, Kolmogorov, and their followers, as well as the fundamental methodology of modern science De Finetti notably recognized this discrepancy, highlighting it in his thought-provoking essay, where he referenced significant methodological insights from Tilgher.
Truth is not found in an external realm but within the act of thought itself The absolute is accessible through our knowledge rather than hidden in mystery Thought serves as a biological function that helps us navigate life, enrich our experiences, and effectively engage with reality.