We propose a layered statistical model for image segmentation and labeling obtained by combining independently extracted, possibly overlapping sets of figure-ground (FG) segmentations. The process of constructing consistent image segmentations, called tilings, is cast as optimization over sets of maximal cliques sampled from a graph connecting all non-overlapping figure-ground segment hypotheses. Potential functions over cliques combine unary, Gestalt-based figure qualities, and pairwise compatibilities among spatially neighboring segments, constrained by T-junctions and the boundary interface statistics of real scenes. Building on the segmentation layer, we further derive a joint image segmentation and labeling model (JSL) which, given a bag of FGs, constructs a joint probability distribution over both the compatible image interpretations (tilings) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, followed by sampling labelings conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on maximum likelihood with a novel estimation procedure we refer to as incremental saddle-point approximation. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that are rated as probable by candidate models during learning. State of the art results are reported on the Berkeley, Stanford and Pascal VOC datasets, where an improvement of 28 % was achieved for the segmentation task only (tiling), and an accuracy of 47.8 % was obtained on the test set of VOC12 for semantic labeling (JSL).