section{Related detectors, they are negatively affected by high density

section{Related Work}There exists a handful of traditional approaches of crowd counting from single images which mostly rely on extracting and using low level visual features. These features are then mapped to the count or a density map using various regression techniques. On the other hand, the recent approaches mostly use CNNs somehow in their structure to extract features and/or predicting the final count. Loy et al. cite{loy} has categorized the existing methods into three groups: (1) detection-based methods (2) regression-based methods, and (3) density estimation based methods. Following, we briefly review some of these methods (interested reader may refer to cite{loy} for a more exhaustive review).subsection{Detection–based Methods:}Detection–based methods basically follow sliding window based detection algorithms to count the number of object instances in a particular image cite{detection}. Their process is based on Dirichlet Process MixtureModel (DPMM) which is a non–parametric process. Moreover, several works have beenproposed to count the pedestrians by detection cite{29,14,27}. Although these methods are proved to be better detectors, they are negatively affected by high density crowd and background clutter. subsection{Regression-based Methods:}To overcome the said issues regarding detection–based methods, another group of methods are proposed to count by regression where the learning procedure is based on mapping features extracted from local image patches to their counts. Idrees et al. cite{idrees} proposed a novel method to leverage multiple sources of information to compute an estimation of the number of people present in a dense crowd image. The use of global consistency and multi–source information made their method to some extent robust against the multi–scale perspective and occlusion issues.subsection{Density Estimation-Based Methods:}Over the time, due to the effectiveness of CNNs in variouscomputer vision applications, a few CNN–based approaches havebeen developed for crowd counting. Walachet al. cite{walachwcnn} proposed one of the well–known method in this category. They used CNNs for layer–wise training network. Novelty of their work lies in two promising additions, namely as layered boosting and selective sampling. This method, moreover is robust to the outliers.In contrary to the layer–wise estimation methods,Shang et al. cite{shang} proposed an end–to–end estimation method for crowd counting. This method uses CNNs to simultaneously learn local (patches of input image) and global (full–size input image) counts. These two count numbers are finally fused together to predict the final count which is scale variant.Another challenging task in crowd counting is to deal with unseen target crowd extit{scene}. Most of the existing methods fail to cope with this issue and their accuracies drop significantly when they are applied to an unseen scene. In an attempt to overcome this problem, Zhang et al. cite{zhangcross} proposed a data–driven method by fine–tuning the training layers for unseen target scenes. While their method tackle the unseen scene issue, it is still vulnerable to another problem called extit{scale variation}.To address the issue of scale variation in crowded images, Zhang et al. cite{mcnn} proposed a multicolumn architecture for crowd counting (MCNN). This multi-column architecture uses three branches of CNNs, each aimed to work at a different scale level of the input, so that the model becomes more scale invariant. They down–sample the input image into $1/4^{th}$ of the original size to train their model using L2 loss. Finally, the outputs of all three branches are concatenated to produce the density map of the input input . To get the overall count they simply calculated the sum of the density maps. The novelty of this process is that it can take arbitrary size of images as input. In addition, they also introduced a large scale annotated dataset,called ShanghaiTech dataset cite{mcnn}.Onoro-Rubio and L´opez-Sastre cite{onoro} addressed the scale variation issue by proposing a scale–aware counting model called Hydra CNN. This model is formulated as a regression model in which the networklearns  the way of mapping of  the  image  patches  to  their  corresponding  object  density  maps. Also, Boominathan et al. cite{crowdnet} tackle the issue of scale variation usinga combination of shallow and deep networks along withan extensive data augmentation which is performed by means of sampling patches frommulti–scale representations of the input image.Although these methods are proved to be robust to scale variations, they have some adverse effect that limits the size of input image during training. By reducing the training image size, they would not be capable of learning the features of original image size. To address this drawback, Sindagi et al. cite{cascaded} presented a novel end–to–end cascaded CNN that jointly produces the estimated count and high quality density map. The high–level features of their network enables it to learn globally relevant and discriminative features. Inspired by cascaded multi-task learning cite{cascaded}, Sindagiet al. cite{cascaded} proposed to learn high–level features and performdensity estimation in a cascaded setting. In contrast toMCNN approachcite{mcnn}, this paper is specifically aimed at reducingoverestimation/underestimation of count error by systemicallyleveraging context information in form of different crowd density levelsusing different networks. Additionally,we incorporate local context andadversarial loss to improve the quality of densitymaps. Most recently, Sam et al. cite{scnn} proposed a Switching-CNNnetwork that intelligently chooses the most optimalregressor among several independent regressors for a particularinput patch. They used hard decision (i.e. one–hot decision) using switch network to pick a single regressor for each patch of the image. Neither of these networks however, provides coarse–to–fine output. Inspired by these ideas, we develop our model to be both coarse–to–fine and performs reasonable in comparison with existing models.