Abstract – Ahugeamount of data (Data in the unit of Exabyte or Zettabyte) is called Big Data.Toquantify such a large amount of data and store electronically is not easy.Hadoopsystem is used to handle these large data sets. To collect big data according to the request, Map Reduceprogram is used. In order to achieve greater performance, big data requiresproper scheduling. To minimize starvation and maximize the utilization of resource,scheduling technique are used to assign the jobs to available resources.
The Performancecan be increasedby implementing deadline constraints on jobs. The goal of theresearch is to study and analyze various scheduling algorithm for betterperformance.Index Terms – Big Data,Map Reduce, Hadoop, Job Scheduling Algorithms. I. INTRODUCTIONCurrently,the term big data 1 has become very trendy in Information Technology segment.Big data refers to broad range of datasets which are hard to be managed byprevious conventional applications.
Big data can be applied in finance andbusiness, banking, online and onsite purchasing, healthcare, astronomy,oceanography, engineering, and many other fields. These datasets are verydifficult and are rising exponentially day by day in very large amount. As data is increasing in volume, in variety andwith high velocity, it leads to complexities in processing it. To correlate,link, match and transform such big data is a complex process. Big data being adeveloping field has a lot of research problems and challenges to address. Themajor research problems in big data are following: 1) Handling data volume, 2)Analysis of big data, 3) Privacy of data, 4) Storage of huge amount of data, 5)Data visualization, 6) Job scheduling in big data, 7) Fault tolerance. 1) Handling data volume1 2: The large amount of data coming from different fields of science suchas biology, astronomy, meteorology, etc makes its processing verydifficult to the scientists.
2) Analysisof big data: it is difficult to analyze big data due to heterogeneity andincompleteness of data.Collecteddata can be in different formats, variety and structure 3. 3) Privacy of datain the context of big data 3: There is public fear regarding theinappropriate use of personal data, particularly through linking of data frommultiple sources. Managing privacy is both a technical and a Sociologicalproblem. 4) Storage of huge amount of data 1 3: it represents the problemof how to recognize and store important information, extracted fromunstructured data, efficiently.
5) Data visualization 1: Data processingtechniques should be efficient enough to enable real time visualization. 6) Jobscheduling in big data 4: This problem focuses on efficient scheduling ofjobs in a distributed environment. 7) Fault tolerance 5: is another issue inHadoop framework in big data. In Hadoop, NameNode is a single point of failure.Replication of block is one of the fault tolerance technique used by Hadoop.Fault tolerance techniques must be efficient enough to handle failure indistributed environment. MapReduce 6 provides an ideal framework forprocessing of such large datasets by using parallel and distributed programmingapproaches. II.
MAPREDUCEMapReducingoperations depend on two function such as Map and Reduce function. Boththe functionsare written based on the needs of the user. The Map functiontakes an input pairandgenerates a set of intermediate or middle key or the value pairs. Then the MapReducelibrarywill collectsall the middle values that are associated with the samemiddle key andtransfer them into the Reduce function for additional operations.TheReduce function obtains an intermediate or middle key with an integrated set ofvalues. And it associates thosevalues to make it as a smaller set of values.
The Figure 1 shows all process of MapReduce. Fig.1TheOverall MapReduce Word Count Process. III. HADOOP ARCHITECTUREScheduling decisions which aretaken by the master node are called as Job Tracker and by the worker nodes arecalled as Task Tracker which executes the tasks.Fig.
2HadoopArchitecture 11A Hadoopcluster includes a single master node and multiple slave nodes. Figure 2 showsHadoop Architecture. Asingle master node reside a Job tracker, Task tracker,Name node and Data node.A. Job trackerThe primaryfunction of the job tracker is to manage the task trackers and trackingresource availability. The Job tracker is a node which controls the jobexecution process. Job tracker performs mapreduce tasks to a particular node inthe cluster.
Client submits jobs to the Job tracker. When the work iscompleted, the Job tracker updates its status. Client applications can ask theJob tracker for information. B. Task trackerIt follows theorders of the job tracker and updates the job tracker with its statusperiodically. Task tracker run tasks and send the reports to Job tracker, whichkeeps a complete record of each job. Every Task tracker is configured with aset of slots which indicates the number of tasks that it can accept.C.
Name nodeThe name nodeplots toblock locations. Whenever a data node undergoes a disk corruption of aparticular block, the first table gets updated and whenever a data node isdetected to be dead due to network failure or a node failure, both the tablesget updated. The updating of the tables is based on only failure of the nodes.It does not depend on any neighbor blocks or any block locations to identifyits destination.
Each block is separated with its job nodes and respectiveallocated process.D. Data nodeThe node whichstores the data in hadoop system is known to be as data node. All data nodessend a heartbeat message to the name node for every three seconds to say thatthey are alive.
If the name node does not receive a heartbeat from a particulardata node for ten minutes, then the name node consider that data node is deador out of service. It initiates some other data node for the process. The datanodes updates the name node with the block information periodically. IV. JOB SCHEDULING IN BIGDATAThedefault Scheduling algorithm is supported on FIFO where jobs were executed inthe magnitude of their humility. Later on the cognition to set the priority ofa Job was added.
Facebook and Character contributed meaningful apply inprocessing schedulers i.e. Legible Scheduler 8 and Capacity Scheduler 9respectively which after free to Hadoop Dominion. This section describesvarious Job Scheduling algorithms in big data.A. Default FIFO SchedulingThedefault Hadoop scheduler operates using a FIFO queue. After a job is dividedinto independent tasks, they are ended into the queue and allotted to freeslots as they get acquirable on Task Tracker nodes. Although there is keep fordecision of priorities to jobs, this is not revolved on by default.
Typicallyapiece job would use the complete assemble, so jobs had to inactivity for theirrelease. Regularize though a distributed constellate offers zealous latent foroffering larger resources to numerous users, the job of intercourse resourcesevenhandedly between users requires a turn scheduler. Production jobs bet in arational indication.
B. Fair SchedulingTheFair Scheduler 8 was developed at Facebook to manage access to their Hadoopcluster and subsequently released to the Hadoop community. The Fair Scheduler plansto provide each user a fair share of the cluster capacity in excess of time.
Users may allocate jobs to pools, with every pool owed a guaranteed smallestnumber of Map and Reduce slots. Free slots in unsuccessful pools may be owed tonew pools; piece immoderateness ability within a pool is joint among jobs. TheFair Scheduler maintains preemption, so if a pool has not received its faircontract for a destined period of measure, then the scheduler module will denial tasks in pools flowing overcapacity in dictate to afford the slots to the pool functional under capacity.In addition, administrators may enforce priority settings on doomed pools.Tasks are therefore scheduled in an interleaved fashion, supported on theirpriority within their pool, and the constellate capacity and activity of theirpool. As jobs contain their tasks assigned to Task Tracker slots forcalculation, the scheduler follows the shortfall between the become ofcalculate really old and the saint fair percentage for that job. Eventually,this has the result of ensuring that jobs obtain roughly equal amounts ofresources.
Shorter jobs are assigned enough resources to terminate fast.Simultaneously, longer jobs are assured to not be ravenous of resources. C. Capacity SchedulingCapacityScheduler 10 initially developed at Yahoo addresses a usage circumstanceswhere the number of users is huge, and there is a require to make sure a fairassign of calculation resources between users. The Capacity Schedulerallocates jobs supported on the submitting user to queues with configurabledrawing of Map and Minify slots. Queues that hold jobs are bestowed theirorganized capacity; patch a trip capacity in a queue is shared among oppositequeues.
Within a queue, planning operates on a modified priority queuegroundwork with specialized person limits, with priorities orientated supportedon the quantify a job was submitted, and the priority scene allocated to thathuman and accumulation of job. When a Task Tracker receptacle becomes unfixed,the queue with the lowest laden is elite, from which the oldest remaining jobis chosen. A task is then scheduled from that job.
This has the validity ofenforcing meet capacity distribution among users, rather than among jobs, aswas the case in the Fair Scheduler. D. Dynamic Proportional SchedulingAsclaimed bySandholm and Lai 12, Dynamic Proportional scheduling gives a lot ofjob sharing and prioritization that end in increasing share of clusterresources and a lot of differentiation in service levels of various jobs. Thisalgorithm improves response time for multi-user Hadoop environments.E. Resource-AwareAdaptive Scheduling (RAS)Toincreaseutilization of resource among machines even as monitoring the completiontime of process, RAS proposed by Polo et al. 13 for the Map Reduce withmulti-job workloads.
Zhao et al. 14provides task scheduling algorithm based on the resource attribute selection(RAS) to work out its resource assigned by sending a group of test tasks to anexecution node before a task is scheduled and so choose optimal node to executea task consistent with resource needs and appropriateness between the resourcenode and therefore the task, which uses history task information if prevail.F. MapReduce task scheduling withdeadline constraints (MTSD) algorithmAccordingto Tang et al.
15, scheduling algorithmic rule sets two deadlines:map-deadline and reduce-deadline. Reduce-deadline is simply the users’ jobdeadline. Pop et al.
16 presents a classical approach for a periodic taskscheduling by considering a scheduling system with totally different queues forperiodic and aperiodic function and deadline, because the main constraintdevelops a method to guess the quantity of resources required to schedule agroup of an interrupted tasks or function, by considering along implementationand data transfers costs. Based on a numerical model, and by using dissimilarsimulation situations, MTSD proved thefollowing statements: (1) varied sources of independent an episodic tasks willbe measured approximating to a single one; (2) when the quantity of evaluatedresources transcend a data center capability, the tasks migration betweentotally different regional centers is that the appropriate resolution withrelevance the global deadline; and (3) during a heterogeneous data center, wewant higher variety of resources for an equivalent request with relevance thedeadline constraints. In MapReduce, Wang and Li 17 detailed the taskscheduling, for disseminated data centers on heterogeneous networks throughadaptative heartbeats, job deadlines and data locality.
Job deadlines aredividing alongside the foremost data quantity of tasks. With the thought oflimitation, the task scheduling is twisted as an assignment downside in eachheartbeat, during which adaptive heartbeats are supposed by the process timesof tasks and jobs are sequencing in terms of the separated deadlines and tasksare planned by the Hungarian algorithmic program. On the idea of data transferand process times, the most appropriate data center for all mapped jobs aredetermined within the reduce part.G.
Delay SchedulingTheobjective is to deal with the dispute between locality and fairness. once anode requests for a task or function, if the head-of-line job cannot project alocal task, scheduler omit that task and appears at later jobs. If a job hasbeen omited for long, we tend to permit it to project non-native tasks, toavoid starvation. Delayscheduling provisionally relaxes fairness to induce higher locality throughallowing jobs to attend for scheduling on a node among native data. Song et al.18 offer a game assumption based technique to solve scheduling problems byseparating a Hadoop scheduling issue into 2 levels—job level and task level. Forthe job level scheduling, use a bid model to have assuranceto the fairness and reduce the common waiting time.
For tasks level, changescheduling drawback into assignment problem and use Hungarian methodology tooptimize the problem. Wan et al. 19 provides multi-job scheduling algorithmin MapReduce supported game assumption that deals with the competition forresources between many jobs.H. Multi Objective SchedulingNitaet al. 20 explain about scheduling algorithm named MOMTH by consideringobjective functions associated to resources and users within the similar timewith constraints similar to deadline and budget.
Theenact model takes into account as allMapReduce jobs are independent. As there’sno nodes failure before/during scheduling computation, scheduling decision istaken solely based on the present data. Bian et al.
presents schedulingstrategy.Consistent with this scheduling strategy, the cluster finds the speedof the present nodes and creates some backups of the intermediate MapReducedata which results to a high performancecache server. The data created by that node could get it wrong shortly. Hencethe cluster could resume the execution to the previous level rapidly if thereare many nodes going wrong, then the cut back nodes will scan the Map outputfrom the cache server or from both the cache and also from the node, and keepsits performance high.I. HybridMultistage Heuristic Scheduling (HMHS)Chenet al. 21 elaborates heuristic scheduling algorithm named HMHS that makes anattempt to clarify the scheduling trouble by rending it into 2 sub problems:sequencing and dispatching.
For sequencing, theyuse heuristic supported Pri(the modified Johnson’s algorithm). For dispatching, they recommend twoheuristics Min-Min and Dynamic Min-Min. V. TABLE I: COMPARISON OFVARIOUS JOB SCHEDULING ALGORITHMS IN BIGDATA Scheduling Algorithm Technology Advantages Disadvantages Default FIFO Scheduling 22 Schedule jobs based on their priorities in first-in first-out 1. Cost of entire cluster scheduling process is less.
2. Simple to implement and efficient. 1. Designed only for single type of job. 2. Low performance when run multiple types of jobs.
3. Poor response times for short jobs compared to large jobs. Fair Scheduling 8 Do an equal distribution of compute resources among the users/jobs in the system. 1. Less complex 2. Works well when both small and large clusters. 3. It can provide fast response times for small jobs mixed with larger jobs.
1. Does not consider the job weight of each node. Capacity Scheduling10 Maximization the resource utilization and throughput in multi-tenant cluster environment. 1.
Ensure guaranteed access with the potential to reuse unused capacity and prioritize jobs within queues over large cluster. 1. The most complex among three schedulers. Dynamic Proportional Scheduling12 Planned for data intensive workloads and tries to maintain data locality during job execution 1. It is a fast and flexible scheduler. 2. It improves response time for multi-user Hadoop environments. If the system eventually crashes then all unfinished low priority processes gets lost.
Resource-Aware Adaptive Scheduling (RAS) 13 Dynamic Free Slot Advertisement. Free Slot Priorities/Filtering It improves the Job performance. Only takes action on appropriate slow tasks. MapReduce task scheduling with deadline constraints (MTSD)15 Achieve nearly full overlap via the novel idea of including reduce in the overlap. 1. It Reduce computation time.
2. Improve performance for the important class of shuffle-heavy Map Reductions. Better work with small clusters only. Delay Scheduling18 To address the conflict between locality and fairness.
1. Simplicity of scheduling No particular Multi Objective Scheduling20 The executiontype consider as alltheMapReduce jobs are independent, there is no nodes failure before or during the scheduling computation and the scheduling decision is taken only based on present knowledge. It keeps performance is high. Execution Time is too large. Hybrid Multistage Heuristic Scheduling (HMHS)21 Johnson’s algorithm & Min-Min and Dynamic-MinMin algorithm used Achieves not only high data locality rate but also high cluster utilization. It does not ensure reliability. VI.
DISCUSSIONSThispaper provides the classification of Hadoop schedulers based on differentparameters such as time, priority, resources etc. It discuss about how varioustask scheduling algorithms helps in achieving better result in Hadoop cluster. Furthermorethis paper also discusses about advantages and disadvantages of various taskscheduling algorithms. This comparison results shows, each scheduling algorithmhas some advantages and disadvantages. So, all algorithms are important in jobscheduling. VII.
CONCLUSIONSThispaper gives an overall idea about different job scheduling algorithm in the bigdata. And it compares most of the properties of various task schedulingalgorithms. Individual scheduling techniques which areused to upgrade the datalocality, efficiency,make span,fairness and performance are elaborated anddiscussed. However, the scheduling technique is an open area for researchers toresearch.