Event cameras asynchronously output the polarity values of pixel-level log intensity alterations. They are robust against motion blur and can be adopted in challenging light conditions. Owing to these advantages, event cameras have been employed in various vision tasks such as depth estimation, visual odometry, and object detection. In particular, event cameras are effective in stereo depth estimation to ﬁnd correspondence points between two cameras under challenging illumination conditions and/or fast motion. However, because event cameras provide spatially sparse event stream data, it is difﬁ- cult to obtain a dense disparity map. Although it is possible to estimate disparity from event data at the edge of a structure where intensity changes are likely to occur, estimating the disparity in a region where event occurs rarely is challenging. In this study, we propose a deep network that combines the features of an image with the features of an event to generate a dense disparity map. The proposed network uses images to obtain spatially dense features that are lacking in events. In addition, we propose a spatial multi-scale correlation between two fused feature maps for an accurate disparity map. To validate our method, we conducted experiments using synthetic and real-world datasets.