3D Interpretation of Edges: Part 1
1 Introduction
1.1 Monocular depth
Human vision takes advantage of several sources of information in determining distance to objects in the physical surroundings. The most powerful source is provided by stereopsis without which many of the activities humans engage in would be considered impossible to manage. Also motor measurements provide important information. For example ocular convergence is a very good way of exactly determining distance to a particular point of interest. Another way of telling distance is to measure relative motion between objects in the surrounding and the eyes, the phenomenon often referred to as the motion parallax. For example owls utilize this by moving their head horizontally back and forth.
These sources of information all provide precise absolute quantitative measurements of depth to points in the physical surroundings, most easily thought of as represented by real numbers. Another type of depth information are pictorial cues. One of these commonly mentioned is linear perspective [see figure 1.1(a)], which was first extensively investigated during the Renaissance by for example Leonardo da Vinci. Other simple cues can be found in object size where larger object are determined as closer, vertical position where object higher up in the visual field most often are farther away from the perceiver, texture gradient [figure 1.1(b)] which reveals surface orientation, and occlusion [figure 1.1(c)], the phenomena that will be investigated in this thesis, which tells that one object is closer than another since the first partially occludes the second.
Figure 1.1 The 3-D effects of (a) linear perspective, (b) texture gradient and (c)
occlusion.
Pictorial cues can all be found in a photograph which in it self constitutes only a 2D object, but still can portray a 3D scene. Unfortunately, these cues individually only give partial depth information, either relative, as in surface normals, or qualitative, as in occlusion. This presents a ma jor challenge to automatic processing of these cues, since information most often has to be integrated from several sources to build a useful representation. Humans, however, are very skilled at experiencing depth from monocular cues only, for example looking at photographs.
Pictorial cues can all be found in a photograph which in it self constitutes only a 2D object, but still can portray a 3D scene. Unfortunately, these cues individually only give partial depth information, either relative, as in surface normals, or qualitative, as in occlusion. This presents a ma jor challenge to automatic processing of these cues, since information most often has to be integrated from several sources to build a useful representation. Humans, however, are very skilled at experiencing depth from monocular cues only, for example looking at photographs. since stereo techniques fails to provide a 3D interpretation of these areas.
Another reason could be that different internal representations are helpful to manage survival in a three dimensional physical world. These representations might be more easily construed from different depth cues. Let’s say a person want to get to a chair behind a table. The fact that the chair is behind the table can easily be determined from occlusion cues which at first provides an idea of to how the problem of getting to the chair must be approached, that is to somehow first get around the table. However managing navigation of the body might need more precise measurement so that body parts don’t bumps in to the table.
1.2 T-junctions as a cue to occlusion
The most powerful feature in an image that signals occlusion is the T-junction. These arise when one ob ject lies in front of another in the vision field, where the edges of the occluded ob ject meets the occluding ob ject. Consider the objects in figure 1.2(a). Two types of configurations give rise to one or several T-junctions. Either the objects lie beside each other in the vision field as in figure 1.2(b), or one of them occludes the other as in figure 1.2(c). Given that all configurations in 3D-space are equally probable most configurations which give rise to T-junctions does involves occlusion.
A ma jor obstacle is that surfaces can contain T-junctions in its texture. The ideal input to an algorithm that infers depth from occlusion should thus only contain occluding edges. This problem has been avoided in this work by creating stimuli with only occluding edges and no texture. Methods to suppress non-occluding edges that normally arise in edge detection has been developed by for example Månsson (2000) and should be considered a preprocessing stage before occlusion detection.
Figure 1.3 Illusory triangle [from Kanizsa (1979)]. The line ends signal occlusion, and together they form the perception of a white triangle in front of three circular discs and another triangle.
Another problem that arises during edge detection is that not all ob ject discontinuities create gradients in the pro jected image. The illusory triangle in figure 1.3 for example has no edge gradients as it has the same white color as the background. For that reason T-junctions often show up as line ends. This complicates the problem of occlusion detec- tion, since the missing contour fragments of the circles in figure 1.3 could be from color differences in these areas and not occlusion. Limitations of edge detection techniques like the Canny edge detector (Canny, 1986) also often fail to accurately identify edge seg- ments in the most interesting regions; junctions. This often results in slightly misplaced T-junctions and line ends.
1.3 Grouping and completion
Much work has been done within the field of Gestalt psychology, by for example Kanizsa (1979), to identify organizing principles in human perception. For example principles of grouping have been formulated. Some of these are proximity, similarity, closure and continuity (Smith, 1993). These principles bind together features detected on the retina, forming more meaningful units. For example the closer circles in figure 1.4(a) are somehow perceived as more together, as well as the more similar objects in figure 1.4(b).
Figure 1.3 The Gestalt principles of (a) proximity and (b) similarity.
Completion of missing edges, or fragments of edges, is a thoroughly investigated phe- nomenon which partly can be considered grouping. Edge fragments are connected with inferred curves, forming units like surfaces and objects. Missing edge fragments can come from many different reasons: missing detectors, lack of gradient and occlusion. An important distinction is between modal and amodal completion, where modal completion is the process involved in the first two cases and amodal completion in the last one.
Figure 1.5 (a) Two penguins [from Nieder (2002)]. (b) Output from the Ikaros version of the Canny edge detector.
Modal completion refers to all contours that are perceived but not ob jectively present in the stimuli, and can form perceptions of illusory contours as in figure 1.3. Here the illusory contours are indicated by line ends. Other times there are gaps in the edges. In figure 1.5 fragments of the penguins’ boundaries are missing, but humans still perceive two complete penguins.
Amodal completion refers to the perceptual completion of partially occluded objects. For example, in figure 1.6(a) a cube is perceived. One of the most basic principles of amodal completion is that straight lines that continue behind other objects tend to be seen as one continuous line and not two separate fragments [see figure 1.6(b)]. Another fundamental principle in amodal completion is regularity. For example people tend to see a circle behind a square in figure 1.6(c).
Figure 1.6 Some principles of amodal completion. (a) Objects are amodally completed. (b) Straight lines are continued. (c) Objects are completed by regularity (one can see a circle behind the square).
1.4 Neural and cognitive completion
Much is known about low level vision and high level vision, and many computer algo- rithms have been developed that can perform processing tasks that are believed to be performed here. Little is however known about mid level vision, where perceptual organization is done. Somehow low level features captured on the retina are organized to form perceptions of surfaces and objects. The number of features considered is vast, for example color, illumination, binocular disparity, surface orientation and occlusion. All these can contribute to the perception of objects and geometrical relations between these.
Much is known about low level vision and high level vision, and many computer algorithms have been developed that can perform processing tasks that are believed to be performed here. Little is however known about mid level vision, where perceptual organization is done. Somehow low level features captured on the retina are organized to form perceptions of surfaces and objects. The number of features considered is vast, for example color, illumination, binocular disparity, surface orientation and occlusion. All these can contribute to the perception of objects and geometrical relations between these.
Most contemporary work in the area of edge completion are based on neural processing, with concepts like stochastic completions fields (Williams and Jacobs, 1997; Sharon et al., 1997). This presents a strict bottom-up approach to contour completion, and resembles the massively parallel processing that occurs in early vision. It has been reported that illusory contours evoke cortical responses in area 18 in the visual cortex of macaque monkeys (Heidt et al., 1984), which is a ma jor support of this approach.
During the sixties and seventies however much was done to interpret edges and junc- tions with a more cognitively oriented approach (e.g., Waltz, 1975). This research has lead to techniques that can interpret simple line drawings of three dimensional polyhedral ob- jects (Mart ́ı et al., 1994). Although this early work was not concerned with completion, it presents a different way of processing edges, where locally features are extracted and rea- soned about logically. The problem of interpreting features is formulated as manipulating the organization of features on a 2 1 2 D-sketch until globally congruent interpretations are formed (Marr, 1982). This approach has been taken further by Saund (1999) to interpret 3-D structures in real scenes, including inference of illusory contours.
The different representation and processing of stimuli in these two approaches lead to different advantages. The first approach involves no search whereas the second does. This makes neural processing faster and more reliable. However line drawing interpretation has the advantage of leaving the doors open to abstractly integrate many different features of an image, as well as complex apriori knowledge and top-down processing.
The question addressed with the experiments presented here could be stated as what constitutes the intermediate step between the detection of illusory contours in the visual cortex and the phenomenological experience of continuous contours and ob ject scenery. It is obvious that for example top-down processing must start somewhere in the chain between retina and ob ject recognition, but the question is where and how. Neural models can do preprocessing work (e.g., Månsson, 2000), but what happens next. One possibility is that feature detectors locally enhances (or postulates) the illusory contours early in the visual cortex, producing responses resembling stochastic completions fields, but that the integration of these into continuous contours, ob jects and depth relations are done in a more cognitive fashion.
Figure 1.7 (a) Kanizsa-Bergman Display [from Kelly and Grossberg (2000)] (b) The Sambin (1974) cross with ambiguous interpretation.
There seems to be some known facts about perceptual organization that bottom- up processing cannot account for. For example a phenomenon that requires top-down processing is amodal completion, where ob ject recognition plays a great role. In figure 1.7(a) one can easily see the masked B’s as B’s. The appearance of a B is learned and image enhancement techniques alone would not suffice to infer the occluded contours.
Further, interpretations are often ambiguous, as is figure 1.7(b). It seems the illusory ob ject in the figure can be seen as either a square or a circle. This presents a step that implies that hypothesis generation is involved and that illusory contours are interpreted in relation to meaningful entities.
Also, some of the Gestalt principles are difficult to implement with image refinement techniques. It has been attempted to implement the principle of regularity by measures of symmetry (Zabrodsky et al., 1993), which requires symbolic manipulation. Although one wouldn’t expect a mathematical toolbox hard-coded in brain structure, it might indicate something about the representations and processes involved.
2 Extraction of T-junctions and line ends
Junction detection and classification is important to many vision tasks, for example in recognition and grouping. Hence, much work has been done to develop efficient and robust detectors. In this pro ject a T-junction detector was needed, where the angles of the wedges are accurately identified as well as the center of the junction. The algorithm described here was modeled after the detectors developed by Parida et al. (1997) and Sluzek (2001).
First of all locating and extracting the parameters of junctions is a computationally expensive task. For that reason a variant of the Harris corner detector was implemented. Only those points with an output above a threshold are further processed. The wedges of a junction can found by calculating a discrete intensity profile inside a circle centered at the junction, for each angle θ = [0, 2π]. Distortions often occur at the location of junction centers, which motivates the removal of a small circular disk around the center (Parida et al., 1997).
To accumulate the intensity profile filters were produced for 128 directions. At first straight lines was considered, where each pixel gets a value of the length of the line segment that traverses it multiplied by 1/r. This presents a problem however, since a junction center need not be located exactly at a pixel center. Therefore the center pixel was divided into 10 × 10 areas and straight lines were drawn from each of these areas. Further, if the number of angles detected is too small a similar problem arises, since wedges between filters won’t get detected. Therefore the final filters were produced by discretely integrating also over the angle interval the particular filter detects. Finally, the filters are scaled with the sum of all pixel values. The radius of the filters R were set to 11 and the radius of the inner disk R0 to 3.
Figure 1.7 (a) Test image. (b) Harris corner response (superimposed on the original image).
(c) A typical intensity profile for a T-junction.
These filters are convolved with the edge images, in all directions for the points deter- mined a possible junction by the Harris corner detector [see figure 2.1(b)]. The result is a profile as illustrated in figure 2.1(c). From there local maxima are detected, and finally thresholded. If the intensity profile reveals three local maxima and two of these are in opposite direction, the point is classified as a possible T-junction and assigned the sum of the maximas as a measurement of goodness. In a final step local maximas in areas of 7 × 7 pixels are identified and classified as definite T-junctions .
Since simple hand created edge representations were used during subsequent exper- iments a more efficient way to locate possible T-junctions than the Harris corner de- tector was also implemented. These hand created images all contained perfect lines of width 1 with perfect junction areas, the presence of more than two edge pixel in the 8-neighborhood of a particular pixel, was instead used as an indicator of a possible T- junction.
Finally, another feature evaluated in later described algorithms is the line end. A line end can be seen as a junction with one wedge, and a detector was developed in basically the same as the T-junction detector. In this case the most critical point of interest is at the location of the junction center, and therefore R0 was set to zero. A faster line end detector was also implemented since line ends in the hand created stimuli only have one 8-connected pixel in its neighborhood. This reduces the problem to finding the maximum of the intensity profile.
3 Monocular depth from diffusion
When analysing the problem of obtaining depth from occlusion it seems to be about finding the occluding contours, and then somehow sort the contours by their relative depth. T-junctions provide means of inferring such relationships. Algorithms designed to detect occluding edges give very noisy output though, which presents a ma jor obstacle that is not dealt with here. Instead images of occluding edges were drawn by hand presenting perfect edges and perfect junctions. An example of a stimuli created is illustrated in figure 3.1(a).
The configuration in figure 3.1(a) is often perceived by humans as a rectangle behind a square. In this case it is quite easy to infer depth relations by assigning the edges that form the stems of the T-junctions a greater depth than those forming the caps. However, in a 3-D world one obstacle is that occluding contours can have a different depth relation in different points along it’s continuum, as seems to be the case in figure 3.1(b). Somehow the ob jects that make up the edges in this figure must be slanted or bend relative to the perceiver.
The problem presented is clearly under-constrained and additional constraints must be imposed. Often Gestalt principles like simplicity and good continuation are suggested for completion algorithms that share the same lack of constraints, however in 1-D. One way is to consider missing edge fragments as elastica (Sharon et al., 1997). In this case it is reasonable to believe that a surface parallel to the image plane is simpler than a slanted one, and an unbent surface is simpler than a bent one.
Here a diffusion model was considered. Diffusion has been used for many purposes in image processing, for example color and orientation segmentation, motion extraction and symmetry detection (Proesmans and Gool, 1999). It has also been used by Kogo et al. (2002) to reconstruct sub jective surfaces. The numerical integration of a diffusion equation involves massively parallel computation, which resembles the way image processing is done in early human vision, and can thus be considered a model of actual brain processing. However as a neural net it has extremely low connectivity, since pixels in the image are only locally connected by diffusion algorithms.
Figure 3.1 Examples of stimuli. (a) A square behind a rectangle. (b) Edges almost impossible
to interpret as all parallel to the image plane.
Different diffusion equations are used to solve different problems, and often non- homogeneous equations are proposed. In this case simple Gaussian diffusion seems to fit the above stated constraints, and without a source field it leads to the differential equation:
Finally some kind of initial conditions or source fields must be added, that enforces that the surfaces that form the stem of the T-junction are pushed farther away than the surface that forms the cap. This has similarities with the technique used by Kelly and Grossberg (2000), who in their model introduced gaps in the stem of T-junctions, allowing depth to flow out from surface areas close to these.
3.1 Implementation
The numerical integration of differential equation 3.1 can be done by approximating the time and space derivatives:
with
and
Further border conditions are needed, and the most natural choice is one that allows no flow of depth across edges, that is:
where n is the boundary normal.
To impose a source fields a sort of artificial depth pumps are introduced. The T- junctions in the contour image are localized and two small areas on each side of the cap are chosen as sinks. Depth is moved from these sinks to an area about 3 pixels away from the junction in the continuation of the stem until the difference of depth is 10 between each individual sink and the source. The amount of depth pumped in each iteration is 0.1 · (10 − (fsource − fsink)).
Figure 3.2 (a) Stimuli. (b) Resulting depth map.
Figure 3.3 A modified version of the Kanizsa stratification display (Kanizsa, 1985).
3.2 Results
First the stimuli in figure 3.2(a). After the T-junctions all were identified, 500 iterations were run, with the resulting depth map as presented in figure 3.2(b). The processing time for this image of size 128 × 128 on a Pentium III 500MHz was 3.2 seconds. The resulting depth map produces surfaces that are only bent around the points where the pumps are located.
Next a version of the Kanizsa stratification display was tested (see figure 3.3). This stimuli is far more complicated, and a negative aspect of the resulting depth map in 3.3(b), is that the vertical arm is bend in three places: in the middle, halfway up and halfway down. A more intuitive interpretation would be that they are is bend only in the middle, and lending in the same direction after that.
One task that should be done well by an algorithm that assigns depth from occluding edges is to handle simple sorting like the kind in figure 3.4(a). However, the result could be better [see 3.4(b)]. Results not presented here reveals that the depth map does not get much better even after 5000 iteration. Basically a connection is missing between the upper and lower part of the middle surface in the model. Such a connection was introduced in a following experiment, by enforcing that the derivative is constant in the occluded parts.
Figure 3.4 Two levels of occlusion. (a) Stimuli. (b) After 500 iterations. (c) After 20000
iterations with added conditions (see text).
Figure 3.5 (a) Contours with small deficiencies. (b) After 500 iterations.
This was implemented by connecting several points on each visible side of the occluded surfaces. In a physical world this would equate building pipes that connect these parts. The result after 20000 iterations are presented in figure 3.4(c). In this run the pumps were turned off after 500 iterations.
Also of interest is how well missing fragments in the edge representation are handled, which seems to present no problem for the stimuli in figure 3.5. A more difficult task is to fill illusory contours as in the Kanizsa triangle presented in figure 3.6. The surface of the triangle is somewhat pronounced.
The numerical integration of the diffusion equation 3.1 as formulated in equation 3.2 through 3.4 has the limitation that depth can only diffuse one pixel each iteration. This necessitates many iterations. As illustrated in figure 3.7 there seems to be only local effects after 100 iterations, but after 300 iterations depth relation has spread along the surfaces and not much change after that.
Figure 3.6 (a) A modified version of the Kanizsa triangle. (5) Results after 500 iterations.
Figure 3.7 Evolution through time. (a-e) Results after 100-500 iterations.
3.3 Discussion
The depth map displayed in figure 3.2(b) first of all seems to be quite an intuitive one, where the non-occluded parts are only slightly bent close to the artificial pumps. Also those surfaces that tends to be amodally perceived seems to be continuous across the oc- cluded areas. The result from the more complicated stimuli in figure 3.3 is also quite good, however the distortions in the T-junctions areas are more pronounced. One reason these results are satisfying is probably because these ob ject configurations are quite symmetric. The distortions around the artificial pumps are therefore disguised in the global solution. This does not happen in figure 3.4. The less satisfying performance on the simple stimuli in figure 3.4(a) indicates that a diffusion model have a slightly different purpose than simply sorting surfaces.
Since the solutions seem to stabilize after 300-500 iterations, the problem is not that the algorithm is to slow. Instead other improvements might make the process more successful. I think some good restrictions should be introduced to ensure that the two amodally connected parts of an occluded surface are nicely connected. This was done in 3.4(c) and actually tried for the other stimuli. However these restrictions interacted to strongly with the artificial pumps.
The result presented in figure 3.5 also shows that small noise in the edge representation does not introduce a problem. However figure 3.6(a) illustrated more of a challenge. The depth map in figure 3.6(b) does elevate the areas where the occlusion occurs, and some form of a triangle does get a little more pronounced. However, as expected from diffusion, the illusory contours are not well marked by depth discontinuities. The result is similar to those presented by Kogo et al. (2002). Countless more successful attempts have been made to infer sub jective (illusory) contours. However, I think diffusion has one theoretical advantage. Perception of illusory contours is often believed to be much related to, or even dependent on, depth perception. This algorithm has a clear connection between perception of depth and sub jective figures. Methods to fill in sub jective contours directly do not seem to have such a clear connection.
Ikaros Modules
The functions descibes above were implemented using the following Ikaros modules that will become available in the contributions section.
Module | Function |
---|---|
HarrisCornerResponse | Computes the Harris corner response. |
TDetector | Extracts T-junction positions and orientations. |
FastTDetector | A faster variant for hand created stimuli. |
LineEndDetector | Extracts line ends. |
DiffusedDepth | The diffusion algorithm. |
TConnect | Performs some amodal completion from T-junctions. |
OcclusionCompleter | Performs amodal completion and assigns qualitative depth. |
LineEndConnector | Performs completion and sorts in depth. |
LineEndToContour | Infers illusory contours from line ends. |
EdgeComplete | Finds one solution and applies LineEndToContour. |
MonocularDepth | The combined algorithm. |
FocusCompletion | Sorts T-junctions from focal point and performs completion. |
AttentionCompletion | Finds solutions within selected areas of attention. |
References
Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intel ligence (PAMI), 8:679–698.
Gregory, R. L. (1972). Cognitive contours. Nature, 238:51–52.
Heidt, R. V. D., Peterhans, E., and Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224(4654):1260–1262.
Jarmasz, J. P. (2002). Integrating perceptual organization and attention: A new model for ob ject-based attention. In Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society, pages 494–499.
Kanizsa, G. (1979). Organization in vision: essays on gestalt perception. Praeger, New York.
Kanizsa, G. (1985). Seeing and thinking. Revista di Psicologia, 49:7–30.
Kelly, F. and Grossberg, S. (2000). Neural dynamics of 3-d surface perception: figure- ground separation and lightness perception. Perception & Psychophysics, 62(1):1596– 1618.
Kogo, N., Strecha, C., Fransen, R., Caenen, G., Wagemans, J., and Gool, L. J. V. (2002). Reconstruction of sub jective surfaces from occlusion cues. In Proceedings of Biological ly Motivated Computer Vision 2002: Second International Workshop, BMCV 2002, pages 311–321, Tbingen, Germany.
Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman and Company, San Francisco.
Marti, E., Regincos, J., and Villanueva, J. J. (1994). Line drawing interpretation as polyhedral ob jects to man-machine interaction in cad systems. Advances in Pattern Recognition and Image Analysis, pages 158–169.
Månsson, J. (2000). A computational model of suppressive mechanisms in human contour perception. Lund University Cognitive Studies, 81.
Nieder, A. (2002). Seeing more than meets the eye: processing of illusory contours in animals. Journal of Comparative Physiology, 188:249–260.
Parida, L., Geiger, D., and Hummel, R. (1997). Junctions: detection, classification and reconstruction. In IEEE Transactions on Pattern Analysis and Machine Intel ligence, pages 687–698.
Proesmans, M. and Gool, L. V. (1999). Grouping based on coupled diffusion maps. In Forsyth, D. A., Mundy, J. L., Ges`u, V. D., and Cipolla, R., editors, Shape, Contour and Grouping in Computer Vision, volume 1681 of Lecture Notes in Computer Science, pages 196–216, Heidelberg. Springer.
Rensink, R. A. and Enns, J. T. (1998). Early completion of occluded ob jects. Vision Research, 38:2489–2505. Sambin, M. (1974). Angular margins without gradients. Italian Journal of Psychology, 1:355–361.
Saund, E. (1999). Perceptual organization of occluding contours of opaque surfaces. Computer Vision and Image Understanding, 76(1):70–82.
Sharon, E., Brandt, A., and Basri, R. (1997). Completion energies and scale. In IEEE Transactions on Pattern Analysis and Machine Intel ligence, San Juan, Puerto Rico.
Sluzek, A. (2001). A local algorithm for real-time junction detection in contour images. In Skarbek, W., editor, Proceedings of Computer Analysis of Images and Patterns, 9th International Conference (CAIP 2001), volume 2124 of Lecture Notes in Computer Science, Warsaw, Poland. Springer.
Smith, R. E. (1993). Psychology. West Publishing Company, St. Paul, US.
Waltz, D. (1975). Understanding line drawings of scenes with shadows. In Winston, P. H., editor, The Psychology of Computer Vision. McGraw-Hill, New York.
Westelius, C.-J., Knutsson, H., and Granlund, G. (1995). Low Level Focus of Attention Mechanisms. In Crowley, J. L. and Christensen, H. I., editors, Vision as Process: Basic Research on Computer Vision Systems. Springer.
Williams, L. R. and Hanson, A. R. (1994). Perceptual completion of occluded surfaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 104–112. IEEE Comput. Soc. Press.
Williams, L. R. and Jacobs, D. W. (1997). Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Computation, 9(4):837–858.
Zabrodsky, H., Peleg, S., and Avnir, D. (1993). Completion of occluded shapes using symmetry. In IEEE Conference on Computer Vision and Pattern Recognition (VPR93), pages 678–679, New York.
This text is an excerpt from:
Karlsson, S. (2004). Monocular depth from occluding edges. Lund: Centre for Mathematical Sciences. M Sc Thesis