
Counting Achievable Labelings: Canonical Shattering Coefficients
Verbally defining shattering coefficients seems straightforward at first glance:
Given a hypothesis class H, its nᵗʰ shattering coefficient, denoted Sₙ(H), represents the largest variety of labelings achievable by classifiers in H on a sample of n feature vectors.
But what’s a “labeling”? And what makes it “achievable”? Answering those questions will help us lay some groundwork in pursuit of a more formal definition.
Within the context of binary classification, a labeling of a sample of feature vectors is just any certainly one of the ways we are able to assign values from the set { -1, 1 } to those vectors. As a quite simple example, consider two one-dimensional feature vectors (i.e., points on a number line), x₁ = 1 and x₂ = 2.
The possible labelings are any combination of the classification values we are able to assign the person feature vectors independent of each other. We will represent each labeling as a vector, where the primary and second coordinate represent the values assigned to x₁ and x₂, respectively. The set of possible labelings is thus { (-1, -1), (-1, 1), (1, -1), (1, 1) }. Note that a sample of size 2 yields 2² = 4 possible labelings — we’ll see how this generalizes to arbitrarily-sized samples soon.
We are saying that a labeling is achievable by a hypothesis class H if there exists a classifier h ∈ H from which that labeling may result. Continuing with our easy example, suppose we’re limited to classifiers of the shape x ≥ k, k ∈ ℝ, that’s, one-dimensional thresholds such that anything to the correct of the brink is positively classified. The labeling (1, -1) shouldn’t be achievable by this hypothesis class. x₂ being greater than x₁ implies that any threshold that classifies x₁ positively must do the identical for x₂. The set of achievable labelings is subsequently { (-1, -1), (-1, 1), (1, 1) }.
Having understood the essential terminology, we are able to begin to develop some notation to formally express elements of the verbal definition with which we began.
We keep on with representing labelings as vectors as we did in our easy example, with each coordinate representing the classification value assigned to the corresponding feature vector. There are 2ⁿ possible labelings in total: there are two alternatives for every feature vector, and we are able to consider a labeling as a group of n such decisions, each made independently of the remainder. If a hypothesis class H can achieve all possible labelings of a sample 𝒞ₙ, i.e., if the variety of achievable labelings of 𝒞ₙ equals 2ⁿ, we are saying that H shatters 𝒞ₙ.
Finally, using the notation from above, we converge on a more rigorous definition of Sₙ(H):
In line with our explanation of shattering, Sₙ(H) equalling 2ⁿ implies that there exists a sample of size n that’s shattered by H.
Estimating Hypothesis Class Expressiveness: Canonical VC Dimension
The Vapnik–Chervonenkis (VC) dimension is a strategy to gauge the expressive power of a hypothesis class. It’s based on the thought of shattering we just defined, and it plays a vital role in helping us determine which hypothesis classes are PAC learnable and which aren’t.
Let’s begin by attempting to intuitively define the canonical VC dimension:
Given a hypothesis class H, its VC dimension, denoted VCdim(H), is defined to be the best natural number n for which there exists a sample of size n that’s shattered by H.
Using Sₙ(H) enables us to specific this far more cleanly and succinctly:
VCdim(H) = max{ n ∈ ℕ : Sₙ(H) = 2ⁿ }
Nonetheless, this definition isn’t precise. Note that the set of numbers for which the shattering coefficient equals 2ⁿ could also be infinite. (Consequently, it is feasible that VCdim(H) = ∞.) If that’s the case, the set has no well-defined maximum. We address this by taking the supremum as a substitute:
VCdim(H) = sup{ n ∈ ℕ : Sₙ(H) = 2ⁿ }
This rigorous and concise definition is the one we’ll use moving forward.
Adding Preferences to the Mix: Strategic Shattering Coefficients
Generalizing the canonical notions we just went over in order that they work in a strategic setup is fairly straightforward. Redefining shattering coefficients when it comes to the data point best response we defined within the previous article is practically all we’ll need to do.
Given a hypothesis class H, a preference set R, and a value function c, the nᵗʰ shattering coefficient of Sᴛʀᴀᴄ⟨H, R, c⟩, denoted σₙ(H, R, c), represents the most important variety of labelings achievable by classifiers in H on a set of n potentially-manipulated feature vectors, i.e., n data point best responses.
As a reminder, that is how we defined the info point best response:
We will tweak the notation we utilized in our discussion of canonical shattering coefficients to further formalize this:
The most important difference is that every x within the sample has to have a corresponding r. Apart from that, putting the info point best response where we had x within the canonical case works easily.
As a fast sanity check, let’s consider what happens if R = { 0 }. The realized reward term 𝕀(h(z) = 1) ⋅ r will probably be 0 across all the info points. Maximizing utility thus becomes synonymous with minimizing cost. The most effective strategy to minimize the associated fee incurred by a knowledge point is trivial: never manipulating its feature vector.
Δ(x, r; h) finally ends up at all times just being x, placing us firmly throughout the territory of canonical classification. It follows that σₙ(H, { 0 }, c) = Sₙ(H) for all H, c. That is consistent with our statement that the impartial preference class represented by R = { 0 } is reminiscent of canonical binary classification.
Expressiveness with Preferences: Strategic VC Dimension (SVC)
Having defined the nᵗʰ strategic shattering coefficient, we are able to simply swap out the Sₙ(H) within the canonical definition of the VC dimension for σₙ(H, R, c).
SVC(H, R, c) = sup{ n ∈ ℕ : σₙ(H, R, c) = 2ⁿ }
Based on the instance we considered above, we discover that SVC(H, { 0 }, c) = VCdim(H) for any H, c. Indeed, SVC is to VCdim because the strategic shattering coefficient is to its canonical equivalent: each are elegant generalizations of non-strategic concepts.
From SVC to Strategic PAC Learnability: The Fundamental Theorem of Strategic Learning
We will now use SVC to state the Fundamental Theorem of Strategic Learning, which relates the complexity of a strategic classification problem to its (agnostic) PAC learnability.
A strategic classification instance Sᴛʀᴀᴄ⟨H, R, c⟩ is agnostic PAC learnable if and provided that SVC(H, R, c) is finite. The sample complexity for strategic agnostic PAC learning is m(δ, ε) ≤ Cε ⁻² ⋅ (SVC(H, R, c) + log(1/δ)), with C being a relentless.
We won’t elaborate an excessive amount of on how this will be proven. Suffice it to say that it boils right down to a clever reduction to the (well-documented) Fundamental Theorem of Statistical Learning, which is basically the non-strategic version of the concept. In case you’re mathematically inclined and desirous about the nuts and bolts of the proof, yow will discover it in Appendix B of the paper.
This theorem essentially completes our generalization of classic PAC learning to a strategic classification setting. It shows that the way in which we defined SVC actually doesn’t just make sense in our heads; it actually works as a generalization of VCdim where it matters most. Armed with the Fundamental Theorem, we’re well-equipped to research strategic classification problems much as we’d any old binary classification problem. For my part, having the power to find out whether a strategic problem is theoretically learnable or not is pretty incredible!