Abstract – A huge amount of data (Data in the unit of Exabyte or Zettabyte) is called Big Data.
To quantify such a large amount of data and store electronically is not easy. To process these large datasets, Hadoop system is used. To gather these big data according to the request Map Reduce program is used. For achieving greater performance, big data requires proper scheduling. To minimize starvation and maximize the utilization of resource, scheduling technique are used to assign the jobs to available resources. The Performance can be increasedby implementing deadline constraints on jobs. The goal of the research is to study and analyze various scheduling algorithm for better performance.
Index Terms – Big Data, Map Reduce, Hadoop, Job Scheduling Algorithms.I. INTRODUCTIONCurrently, the term big data 1 has become very trendy in Information Technology segment. Big data refers to broad range of datasets which are hard to be managed by previous conventional applications. Big data can be applied in finance and business, banking, online and onsite purchasing, healthcare, astronomy, oceanography, engineering, and many other fields. These datasets are very difficult and are rising exponentially day by day in very large amount. As data is increasing in volume, in variety and with high velocity, it leads to complexities in processing it. To correlate, link, match and transform such big data is a complex process.
Big data being a developing field has a lot of research problems and challenges to address. The major research problems in big data are following: 1) Handling data volume, 2) Analysis of big data, 3) Privacy of data, 4) Storage of huge amount of data, 5) Data visualization, 6) Job scheduling in big data, 7) Fault tolerance. 1)Handling data volume 1 2: The large amount of data coming from different fields of science such as biology, astronomy, meteorology, etc makes its processing very difficult to the scientists. 2) Analysis of big data: it is difficult to analyze big data due to heterogeneity and incompleteness of data.Collected data can be in different formats, variety and structure 3.
3) Privacy of data in the context of big data 3: There is public fear regarding the inappropriate use of personal data, particularly through linking of data from multiple sources. Managing privacy is both a technical and a Sociological problem. 4) Storage of huge amount of data 1 3: it represents the problem of how to recognize and store important information, extracted from unstructured data, efficiently.
5) Data visualization 1: Data processing techniques should be efficient enough to enable real time visualization. 6) Job scheduling in big data 4: This problem focuses on efficient scheduling of jobs in a distributed environment. 7) Fault tolerance 5: is another issue in Hadoop framework in big data. InHadoop, NameNode is a single point of failure. Replication of block is one of the fault tolerance technique used by Hadoop. Fault tolerance techniques must be efficient enough to handle failure in distributed environment.
MapReduce 6 provides an ideal framework for processing of such large datasets by using parallel and distributed programming approaches.II. MAPREDUCEMapReducing operations depend on two function such as Map and Reduce function. Boththe functions are written for the user need. The Map functiontakes an input pair andgenerates a set of intermediate or middle key or the value pairs. The MapReduce library that collects all the middle values that are associated with the same middle key andtransfer them into the Reduce function for further operations. The Reduce function obtains an intermediate or middle key with integrated set of values.
And it associates thosevalues to make it as a smaller set of values. The Figure 1 shows all process of MapReduce.Fig.1 The Overall MapReduce Word Count Process.
III. HADOOP ARCHITECTUREScheduling decisions which are taken by the master node are called as Job Tracker and by the worker nodes are called as Task Tracker which executes the tasks.Fig.2Hadoop Architecture 11A Hadoop cluster includes a single master node and multiple slave nodes. Figure 2 shows Hadoop Architecture. The single master node consists of a Job tracker, Task tracker, Name node and Data node.A. Job trackerThe primary function of the job tracker is managing the task trackers and tracking resource availability.
The Job tracker is a node which controls the job execution process. Job tracker performs mapreduce tasks to a specific node in the cluster. Client submits jobs to the Job tracker. When the work is completed, the Job tracker updates its status. Client applications can ask the Job tracker for information.B. Task trackerIt follows the orders of the job tracker and updating the job tracker with its status periodically.
Task tracker run tasks and send the reports to Job tracker, which keeps a complete record of each job. Every Task tracker is configured with a set of slots; it indicates the number of tasks that it can accept.C. Name nodeThe name node maps to block locations and which blocks are stored on which data node. Whenever a data node undergoes a disk corruption of a particular block, the first table gets updated and whenever a data node is detected to be dead due to network failure or a node, both the tables get updated. The updating of the tables is based on only failure of the nodes. It does not depend on any neighbor blocks or any block locations to identify its destination.
Eachblock is separated with its job nodes and respective allocated process.D. Data nodeThe node which stores the data in hadoop system is known to be as data node. All data nodes send a heartbeat message to the name node for every three seconds to say that they are alive. If the name node does not receive a heartbeat from a particular data node for ten minutes, then it considers that data node to be dead or out of service.
It initiates some other data node for the process. The data nodes update the name node with the block information periodically.IV. JOB SCHEDULING IN BIGDATAThe default Scheduling algorithm is supported on FIFO where jobs were executed in the magnitude of their humility. Later on the cognition to set the priority of a Job was added. Facebook and Character contributed meaningful apply in processing schedulers i.
e. Legible Scheduler 8 and Capacity Scheduler 9 respectively which after free to Hadoop Dominion. This section describes various Job Scheduling algorithms in big data.A. Default FIFO SchedulingThe default Hadoop scheduler operates using a FIFO queue. After a job is divided into independent tasks, they are ended into the queue and allotted to free slots as they get acquirable on Task Tracker nodes.
Although there is keep for decision of priorities to jobs, this is not revolved on by default. Typically apiece job would use the complete assemble, so jobs had to inactivity for their release. Regularize though a distributed constellate offers zealous latent for offering larger resources to numerous users, the job of intercourse resources evenhandedly between users requires a turn scheduler. Production jobs bet in a rational indication.B. Fair SchedulingThe Fair Scheduler 8 was developed at Facebook to manage access to their Hadoop cluster and subsequently released to the Hadoop community. The Fair Scheduler plans to provide each user a fair share of the cluster capacity in excess of time. Users may allocate jobs to pools, with every pool owed a guaranteed smallest number of Map and Reduce slots.
Free slots in unsuccessful pools may be owed to new pools; piece immoderateness ability within a pool is joint among jobs. The Fair Scheduler maintains preemption, so if a pool has not received its fair contract for a destined period of measure, then the scheduler module veto tasks in pools flowing over capacity in dictate to afford the slots to the pool functional under capacity. In addition, administrators may enforce priority settings on doomed pools. Tasks are therefore scheduled in an interleaved fashion, supported on their priority within their pool, and the constellate capacity and activity of their pool. As jobs contain their tasks assigned to Task Tracker slots for calculation, the scheduler follows the shortfall between the become of calculate really old and the saint fair percentage for that job. Eventually, this has the result of ensuring that jobs obtain roughly equal amounts of resources. Shorter jobs are assigned enough resources to terminate fast. Simultaneously, longer jobs are assured to not be ravenous of resources.
C. Capacity SchedulingCapacity Scheduler 10 initially developed at Yahoo addresses a usage circumstances where the number of users is huge, and there is a require to make sure a fair assign of calculation resources between users.The Capacity Scheduler allocates jobs supported on the submitting user to queues with configurable drawing of Map and Minify slots. Queues that hold jobs are bestowed their organized capacity; patch a trip capacity in a queue is shared among opposite queues. Within a queue, planning operates on a modified priority queue groundwork with specialized person limits, with priorities orientated supported on the quantify a job was submitted, and the priority scene allocated to that human and accumulation of job. When a Task Tracker receptacle becomes unfixed, the queue with the lowest laden is elite, from which the oldest remaining job is chosen.
A task is then scheduled from that job. This has the validity of enforcing meet capacity distribution among users, rather than among jobs, as was the case in the Fair Scheduler.D. Dynamic Proportional SchedulingAs claimed by Sandholm and Lai 12, Dynamic Proportional scheduling gives a lot of job sharing and prioritization that end in increasing share of cluster resources and a lot of differentiation in service levels of various jobs. This algorithm improves response time for multi-user Hadoop environments.E. Resource-Aware Adaptive Scheduling (RAS)To increaseutilization of resource among machines even as monitoring the completion time of process, RAS proposed by Polo et al. 13 for the Map Reduce with multi-job workloads.
Zhao et al. 14 provides task scheduling algorithm based on the resource attribute selection (RAS) to work out its resource assigned by sending a group of test tasks to an execution node before a task is scheduled and so choose optimal node to execute a task consistent with resource needs and appropriateness between the resource node and therefore the task, which uses history task information if prevail.F. MapReduce task scheduling with deadline constraints (MTSD) algorithmAccording to Tang et al.
15, scheduling algorithmic rule sets two deadlines: map-deadline and reduce-deadline. Reduce-deadline is simply the users’ job deadline. Pop et al. 16 presents a classical approach for a periodic task scheduling by considering a scheduling system with totally different queues for periodic and aperiodic function and deadline, because the main constraint develops a method to guess the quantity of resources required to schedule a group of an interrupted tasks or function, by considering along implementation and data transfers costs. Based on a numerical model, and by using dissimilar simulation situations, MTSD proved the following statements: (1) varied sources of independent an episodic tasks will be measured approximating to a single one; (2) when the quantity of evaluated resources transcend a data center capability, the tasks migration between totally different regional centers is that the appropriate resolution with relevance the global deadline; and (3) during a heterogeneous data center, we want higher variety of resources for an equivalent request with relevance the deadline constraints.
In MapReduce, Wang and Li 17 detailed the task scheduling, for disseminated data centers on heterogeneous networks through adaptative heartbeats, job deadlines and data locality. Job deadlines are dividing alongside the foremost data quantity of tasks. With the thought of limitation, the task scheduling is twisted as an assignment downside in each heartbeat, during which adaptive heartbeats are supposed by the process times of tasks and jobs are sequencing in terms of the separated deadlines and tasks are planned by the Hungarian algorithmic program. On the idea of data transfer and process times, the most appropriate data center for all mapped jobs are determined within the reduce part.G. Delay SchedulingThe objective is to deal with the dispute between locality and fairness.
once a node requests for a task or function, if the head-of-line job cannot project a local task, scheduler omit that task and appears at later jobs. If a job has been omited for long, we tend to permit it to project non-native tasks, to avoid starvation.Delay scheduling provisionally relaxes fairness to induce higher locality through allowing jobs to attend for scheduling on a node among native data. Song et al.
18 offer a game assumption based technique to solve scheduling problems by separating a Hadoop scheduling issue into 2 levels—job level and task level.For the job level scheduling, use a bid model to produce guarantee to the fairness and reduce the common waiting time. For tasks level, change scheduling drawback into assignment problem and use Hungarian methodology to optimize the problem.
Wan et al. 19 provides multi-job scheduling algorithm in MapReduce supported game assumption that deals with the competition for resources between many jobs.H. Multi Objective SchedulingNita et al. 20 explain about scheduling algorithm named MOMTH by considering objectivefunctions associated to resources and users within the similar time with constraints similar to deadline and budget.The enact model takes into account as all MapReduce jobs are independent. As there’s no nodes failure before/during scheduling computation, scheduling decision is taken solely based on the present data.
Bian et al. presents scheduling strategy supported fault tolerance. Consistent with this scheduling strategy, the cluster finds the speed of the present nodes and creates some backups of the intermediate MapReduce data which results to a high performance cache server. The data created by that node could get it wrong shortly. Hence the cluster could resume the execution to the previous level rapidly if there are many nodes going wrong, the cut back nodes scan the Map output from the cache server or from both the cache and also the node, and keeps its high performance.I.
Hybrid Multistage Heuristic Scheduling (HMHS)Chen et al. 21 elaborates heuristic scheduling algorithm named HMHS that makes an attempt to clarify the scheduling trouble by rending it into 2 sub problems: sequencing and dispatching. For sequencing, they use heuristic supported Pri (the modified Johnson’s algorithm). For dispatching, they recommend two heuristics Min-Min and Dynamic Min-Min.V.
TABLE I: COMPARISON OF VARIOUS JOB SCHEDULING ALGORITHMS IN BIGDATAScheduling AlgorithmTechnologyAdvantagesDisadvantagesDefault FIFO Scheduling 22Schedule jobs based on their priorities in first-in first-out1. Cost of entire cluster scheduling process is less.2. Simple to implement and efficient.1.
Designed only for single type of job.2. Low performance when run multiple types of jobs.3. Poor response times for short jobs compared to large jobs.Fair Scheduling 8Do an equal distribution of compute resources among the users/jobs in the system.
1. Less complex2. Works well when both small and large clusters.3. It can provide fast response times for small jobs mixed with larger jobs.1. Does not consider the job weight of each node.
Capacity Scheduling10Maximization the resource utilization and throughput in multi-tenant cluster environment.1. Ensure guaranteed access with the potential to reuse unused capacity and prioritize jobs within queues over large cluster.1.
The most complex among three schedulers.Dynamic Proportional Scheduling12Planned for data intensive workloads and tries to maintain data locality during job execution1. It is a fast and flexible scheduler.2. It improves response time for multi-user Hadoop environments.If the system eventually crashes then all unfinished low priority processes gets lost.
Resource-Aware Adaptive Scheduling (RAS) 13Dynamic Free Slot Advertisement. Free Slot Priorities/FilteringIt improves the Job performance.Only takes action on appropriate slow tasks.MapReduce task scheduling with deadline constraints (MTSD)15Achieve nearly full overlap via the novel idea of including reduce in the overlap.1. It Reduce computation time.2. Improve performance for the important class of shuffle-heavy Map Reductions.
Better work with small clusters only.Delay Scheduling18To address the conflict between locality and fairness.1. Simplicity of schedulingNo particularMulti Objective Scheduling20The executiontype consider as allthe MapReduce jobs are independent, there is no nodes failure before or during the scheduling computation and the scheduling decision is taken only based on present knowledge.It keeps performance is high.Execution Time is too large.
Hybrid Multistage Heuristic Scheduling (HMHS)21Johnson’s algorithm & Min-Min and Dynamic-MinMin algorithm usedAchieves not only high data locality rate but also high cluster utilization.It does not ensure reliability.VI. DISCUSSIONSThis paper provides the classification of Hadoop schedulers based on different parameters such as time, priority, resources etc.
It discuss about how various task scheduling algorithms helps in achieving better result in Hadoop cluster. Furthermore this paper also discusses about advantages and disadvantages of various task scheduling algorithms. This comparison results shows, each scheduling algorithm has some advantages and disadvantages.
So, all algorithms are important in job scheduling.VII. CONCLUSIONSThis paper gives an overall idea about different job scheduling algorithm in the big data. And it compares most of the properties of various task scheduling algorithms. Individual scheduling techniques which areused to upgrade the data locality, efficiency,make span,fairness and performance are elaborated and discussed.
However, the scheduling technique is an open area for researchers to research.