summary refs log tree commit diff
path: root/kernel
AgeCommit message (Collapse)Author
2011-01-05Merge commit 'v2.6.37' into sched/coreIngo Molnar
Merge reason: Merge the final .37 tree. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-04sched, autogroup: Fix reference leakMike Galbraith
The cgroup exit mess also uncovered a struct autogroup reference leak. copy_process() was simply freeing vs putting the signal_struct, stranding a reference. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <1293784350.6839.2.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-04sched, autogroup: Fix potential access to freed memoryMike Galbraith
Oleg pointed out that the /proc interface kref_get() useage may race with the final put during autogroup_move_group(). A signal->autogroup assignment may be in flight when the /proc interface dereference, leaving them taking a reference to an already dead group. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292508592.5940.28.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-04sched: Remove redundant CONFIG_CGROUP_SCHED ifdefYong Zhang
CONFIG_[FAIR|RT]_GROUP_SCHED always means CONFIG_CGROUP_SCHED Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1293803938-8157-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-03watchdog: Improve initialisation error message and documentationBen Hutchings
The error message 'NMI watchdog failed to create perf event...' does not make it clear that this is a fatal error for the watchdog. It also currently prints the error value as a pointer, rather than extracting the error code with PTR_ERR(). Fix that. Add a note to the description of the 'nowatchdog' kernel parameter to associate it with this message. Reported-by: Cesare Leonardi <celeonar@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: 599368@bugs.debian.org Cc: 608138@bugs.debian.org Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .37.x and later LKML-Reference: <1294009362.3167.126.camel@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-29fix freeing user_struct in user cacheHillf Danton
When racing on adding into user cache, the new allocated from mm slab is freed without putting user namespace. Since the user namespace is already operated by getting, putting has to be issued. Signed-off-by: Hillf Danton <dhillf@gmail.com> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-28Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ring_buffer: Off-by-one and duplicate events in ring_buffer_read_page
2010-12-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: print out alloc information with KERN_DEBUG instead of KERN_INFO kthread_work: make lockdep happy
2010-12-23ring_buffer: Off-by-one and duplicate events in ring_buffer_read_pageDavid Sharp
Fix two related problems in the event-copying loop of ring_buffer_read_page. The loop condition for copying events is off-by-one. "len" is the remaining space in the caller-supplied page. "size" is the size of the next event (or two events). If len == size, then there is just enough space for the next event. size was set to rb_event_ts_length, which may include the size of two events if the first event is a time-extend, in order to assure time- extends are kept together with the event after it. However, rb_advance_reader always advances by one event. This would result in the event after any time-extend being duplicated. Instead, get the size of a single event for the memcpy, but use rb_event_ts_length for the loop condition. Signed-off-by: David Sharp <dhsharp@google.com> LKML-Reference: <1293064704-8101-1-git-send-email-dhsharp@google.com> LKML-Reference: <AANLkTin7nLrRPc9qGjdjHbeVDDWiJjAiYyb-L=gH85bx@mail.gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-12-22taskstats: pad taskstats netlink response for aligment issues on ia64Jeff Mahoney
The taskstats structure is internally aligned on 8 byte boundaries but the layout of the aggregrate reply, with two NLA headers and the pid (each 4 bytes), actually force the entire structure to be unaligned. This causes the kernel to issue unaligned access warnings on some architectures like ia64. Unfortunately, some software out there doesn't properly unroll the NLA packet and assumes that the start of the taskstats structure will always be 20 bytes from the start of the netlink payload. Aligning the start of the taskstats structure breaks this software, which we don't want. So, for now the alignment only happens on architectures that require it and those users will have to update to fixed versions of those packages. Space is reserved in the packet only when needed. This ifdef should be removed in several years e.g. 2012 once we can be confident that fixed versions are installed on most systems. We add the padding before the aggregate since the aggregate is already a defined type. Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems") previously addressed the alignment issues by padding out the pid field. This was supposed to be a compatible change but the circumstances described above mean that it wasn't. This patch backs out that change, since it was a hack, and introduces a new NULL attribute type to provide the padding. Padding the response with 4 bytes avoids allocating an aligned taskstats structure and copying it back. Since the structure weighs in at 328 bytes, it's too big to do it on the stack. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: Brian Rogers <brian@xyzw.org> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Guillaume Chazarain <guichaz@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22kthread_work: make lockdep happyYong Zhang
spinlock in kthread_worker and wait_queue_head in kthread_work both should be lockdep sensible, so change the interface to make it suiltable for CONFIG_LOCKDEP. tj: comment update Reported-by: Nicolas <nicolas.mailhot@laposte.net> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Andy Walls <awalls@md.metrocast.net> Tested-by: Andy Walls <awalls@md.metrocast.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-20Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Remove debugging check
2010-12-19sched: Remove debugging checkIngo Molnar
Linus reported that the new warning introduced by commit f26f9aff6aaf "Sched: fix skip_clock_update optimization" triggers. The need_resched flag can be set by other CPUs asynchronously so this debug check is bogus - remove it. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <AANLkTinJ8hAG1TpyC+CSYPR47p48+1=E7fiC45hMXT_1@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19Merge branches 'x86-fixes-for-linus' and 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-32: Make sure we can map all of lowmem if we need to x86, vt-d: Handle previous faults after enabling fault handling x86: Enable the intr-remap fault handling after local APIC setup x86, vt-d: Fix the vt-d fault handling irq migration in the x2apic mode x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic x86, xsave: Use alloc_bootmem_align() instead of alloc_bootmem() bootmem: Add alloc_bootmem_align() x86, gcc-4.6: Use gcc -m options when building vdso x86: HPET: Chose a paranoid safe value for the ETIME check x86: io_apic: Avoid unused variable warning when CONFIG_GENERIC_PENDING_IRQ=n * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix off by one in perf_swevent_init() perf: Fix duplicate events with multiple-pmu vs software events ftrace: Have recordmcount honor endianness in fn_ELF_R_INFO scripts/tags.sh: Add magic for trace-events tracing: Fix panic when lseek() called on "trace" opened for writing
2010-12-19Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix the irqtime code for 32bit sched: Fix the irqtime code to deal with u64 wraps nohz: Fix get_next_timer_interrupt() vs cpu hotplug Sched: fix skip_clock_update optimization sched: Cure more NO_HZ load average woes
2010-12-19sched: Fix interactivity bug by charging unaccounted run-time on entity ↵Paul Turner
re-weight Mike Galbraith reported poor interactivity[*] when the new shares distribution code was combined with autogroups. The root cause turns out to be a mis-ordering of accounting accrued execution time and shares updates. Since update_curr() is issued hierarchically, updating the parent entity weights to reflect child enqueue/dequeue results in the parent's unaccounted execution time then being accrued (vs vruntime) at the new weight as opposed to the weight present at accumulation. While this doesn't have much effect on processes with timeslices that cross a tick, it is particularly problematic for an interactive process (e.g. Xorg) which incurs many (tiny) timeslices. In this scenario almost all updates are at dequeue which can result in significant fairness perturbation (especially if it is the only thread, resulting in potential {tg->shares, MIN_SHARES} transitions). Correct this by ensuring unaccounted time is accumulated prior to manipulating an entity's weight. [*] http://xkcd.com/619/ is perversely Nostradamian here. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20101216031038.159704378@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19sched: Move periodic share updates to entity_tick()Paul Turner
Long running entities that do not block (dequeue) require periodic updates to maintain accurate share values. (Note: group entities with several threads are quite likely to be non-blocking in many circumstances). By virtue of being long-running however, we will see entity ticks (otherwise the required update occurs in dequeue/put and we are done). Thus we can move the detection (and associated work) for these updates into the periodic path. This restores the 'atomicity' of update_curr() with respect to accounting. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101216031038.067028969@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19Merge commit 'v2.6.37-rc6' into sched/coreIngo Molnar
Merge reason: Update to the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: x86: avoid high BIOS area when allocating address space x86: avoid E820 regions when allocating address space x86: avoid low BIOS area when allocating address space resources: add arch hook for preventing allocation in reserved areas Revert "resources: support allocating space within a region from the top down" Revert "PCI: allocate bus resources from the top down" Revert "x86/PCI: allocate space from the end of a region, not the beginning" Revert "x86: allocate space within a region top-down" Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode" PCI: Update MCP55 quirk to not affect non HyperTransport variants
2010-12-17resources: add arch hook for preventing allocation in reserved areasBjorn Helgaas
This adds arch_remove_reservations(), which an arch can implement if it needs to protect part of the address space from allocation. Sometimes that can be done by just putting a region in the resource tree, but there are cases where that doesn't work well. For example, x86 BIOS E820 reservations are not related to devices, so they may overlap part of, all of, or more than a device resource, so they may not end up at the correct spot in the resource tree. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-12-17Revert "resources: support allocating space within a region from the top down"Bjorn Helgaas
This reverts commit e7f8567db9a7f6b3151b0b275e245c1cef0d9c70. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-12-16PM / Hibernate: Restore old swap signature to avoid user space breakageRafael J. Wysocki
Commit 3624eb0 (PM / Hibernate: Modify signature used to mark swap) attempted to modify hibernate signature used to mark swap partitions containing hibernation images, so that old kernels don't try to handle compressed images. However, this change broke resume from hibernation on Fedora 14 that apparently doesn't pass the resume= argument to the kernel and tries to trigger resume from early user space. This doesn't work, because the signature is now different, so the old signature has to be restored to avoid the problem. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=22732 . Reported-by: Dr. David Alan Gilbert <linux@treblig.org> Reported-by: Zhang Rui <rui.zhang@intel.com> Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-12-16PM / Hibernate: Fix PM_POST_* notification with user-space suspendTakashi Iwai
The user-space hibernation sends a wrong notification after the image restoration because of thinko for the file flag check. RDONLY corresponds to hibernation and WRONLY to restoration, confusingly. Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org
2010-12-16Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent
2010-12-16sched: Fix the irqtime code for 32bitPeter Zijlstra
Since the irqtime accounting is using non-atomic u64 and can be read from remote cpus (writes are strictly cpu local, reads are not) we have to deal with observing partial updates. When we do observe partial updates the clock movement (in particular, ->clock_task movement) will go funny (in either direction), a subsequent clock update (observing the full update) will make it go funny in the oposite direction. Since we rely on these clocks to be strictly monotonic we cannot suffer backwards motion. One possible solution would be to simply ignore all backwards deltas, but that will lead to accounting artefacts, most notable: clock_task + irq_time != clock, this inaccuracy would end up in user visible stats. Therefore serialize the reads using a seqcount. Reviewed-by: Venkatesh Pallipadi <venki@google.com> Reported-by: Mikael Pettersson <mikpe@it.uu.se> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292242434.6803.200.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-16sched: Fix the irqtime code to deal with u64 wrapsPeter Zijlstra
Some ARM systems have a short sched_clock() [ which needs to be fixed too ], but this exposed a bug in the irq_time code as well, it doesn't deal with wraps at all. Fix the irq_time code to deal with u64 wraps by re-writing the code to only use delta increments, which avoids the whole issue. Reviewed-by: Venkatesh Pallipadi <venki@google.com> Reported-by: Mikael Pettersson <mikpe@it.uu.se> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292242433.6803.199.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-16perf: Fix off by one in perf_swevent_init()Dan Carpenter
The perf_swevent_enabled[] array has PERF_COUNT_SW_MAX elements. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101024195041.GT5985@bicker> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-14Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: It is likely that WORKER_NOT_RUNNING is true MAINTAINERS: Add workqueue entry workqueue: check the allocation of system_unbound_wq
2010-12-14workqueue: It is likely that WORKER_NOT_RUNNING is trueSteven Rostedt
Running the annotate branch profiler on three boxes, including my main box that runs firefox, evolution, xchat, and is part of the distcc farm, showed this with the likelys in the workqueue code: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 96 996253 99 wq_worker_sleeping workqueue.c 703 96 996247 99 wq_worker_waking_up workqueue.c 677 The likely()s in this case were assuming that WORKER_NOT_RUNNING will most likely be false. But this is not the case. The reason is (and shown by adding trace_printks and testing it) that most of the time WORKER_PREP is set. In worker_thread() we have: worker_clr_flags(worker, WORKER_PREP); [ do work stuff ] worker_set_flags(worker, WORKER_PREP, false); (that 'false' means not to wake up an idle worker) The wq_worker_sleeping() is called from schedule when a worker thread is putting itself to sleep. Which happens most of the time outside of that [ do work stuff ]. The wq_worker_waking_up is called by the wakeup worker code, which is also callod outside that [ do work stuff ]. Thus, the likely and unlikely used by those two functions are actually backwards. Remove the annotation and let gcc figure it out. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-08printk: Use this_cpu_{read|write} api on printk_pendingEric Dumazet
__get_cpu_var() is a bit inefficient, lets use __this_cpu_read() and __this_cpu_write() to manipulate printk_pending. printk_needs_cpu(cpu) is called only for the current cpu : Use faster __this_cpu_read(). Remove the redundant unlikely on (cpu_is_offline(cpu)) test: # size kernel/printk.o* text data bss dec hex filename 9942 756 263488 274186 42f0a kernel/printk.o.new 9990 756 263488 274234 42f3a kernel/printk.o.old Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290788536.2855.237.camel@edumazet-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08sched: Make pushable_tasks CONFIG_SMP dependantDario Faggioli
As noted by Peter Zijlstra at https://lkml.org/lkml/2010/11/10/391 (while reviewing other stuff, though), tracking pushable tasks only makes sense on SMP systems. Signed-off-by: Dario Faggioli <raistlin@linux.it> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1291143093.2697.298.camel@Palantir> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: we want to queue up dependent cleanup Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08nohz: Fix get_next_timer_interrupt() vs cpu hotplugHeiko Carstens
This fixes a bug as seen on 2.6.32 based kernels where timers got enqueued on offline cpus. If a cpu goes offline it might still have pending timers. These will be migrated during CPU_DEAD handling after the cpu is offline. However while the cpu is going offline it will schedule the idle task which will then call tick_nohz_stop_sched_tick(). That function in turn will call get_next_timer_intterupt() to figure out if the tick of the cpu can be stopped or not. If it turns out that the next tick is just one jiffy off (delta_jiffies == 1) tick_nohz_stop_sched_tick() incorrectly assumes that the tick should not stop and takes an early exit and thus it won't update the load balancer cpu. Just afterwards the cpu will be killed and the load balancer cpu could be the offline cpu. On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide on which cpu a timer should be enqueued (see __mod_timer()). Which leads to the possibility that timers get enqueued on an offline cpu. These will never expire and can cause a system hang. This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses get_nohz_timer_target() which doesn't have that problem. However there might be other problems because of the too early exit tick_nohz_stop_sched_tick() in case a cpu goes offline. The easiest and probably safest fix seems to be to let get_next_timer_interrupt() just lie and let it say there isn't any pending timer if the current cpu is offline. I also thought of moving migrate_[hr]timers() from CPU_DEAD to CPU_DYING, but seeing that there already have been fixes at least in the hrtimer code in this area I'm afraid that this could add new subtle bugs. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Sched: fix skip_clock_update optimizationMike Galbraith
idle_balance() drops/retakes rq->lock, leaving the previous task vulnerable to set_tsk_need_resched(). Clear it after we return from balancing instead, and in setup_thread_stack() as well, so no successfully descheduled or never scheduled task has it set. Need resched confused the skip_clock_update logic, which assumes that the next call to update_rq_clock() will come nearly immediately after being set. Make the optimization robust against the waking a sleeper before it sucessfully deschedules case by checking that the current task has not been dequeued before setting the flag, since it is that useless clock update we're trying to save, and clear unconditionally in schedule() proper instead of conditionally in put_prev_task(). Signed-off-by: Mike Galbraith <efault@gmx.de> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org LKML-Reference: <1291802742.1417.9.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08sched: Cure more NO_HZ load average woesPeter Zijlstra
There's a long-running regression that proved difficult to fix and which is hitting certain people and is rather annoying in its effects. Damien reported that after 74f5187ac8 (sched: Cure load average vs NO_HZ woes) his load average is unnaturally high, he also noted that even with that patch reverted the load avgerage numbers are not correct. The problem is that the previous patch only solved half the NO_HZ problem, it addressed the part of going into NO_HZ mode, not of comming out of NO_HZ mode. This patch implements that missing half. When comming out of NO_HZ mode there are two important things to take care of: - Folding the pending idle delta into the global active count. - Correctly aging the averages for the idle-duration. So with this patch the NO_HZ interaction should be complete and behaviour between CONFIG_NO_HZ=[yn] should be equivalent. Furthermore, this patch slightly changes the load average computation by adding a rounding term to the fixed point multiplication. Reported-by: Damien Wyart <damien.wyart@free.fr> Reported-by: Tim McGrath <tmhikaru@gmail.com> Tested-by: Damien Wyart <damien.wyart@free.fr> Tested-by: Orion Poplawski <orion@cora.nwra.com> Tested-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Cc: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1291129145.32004.874.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08perf: Fix duplicate events with multiple-pmu vs software eventsPeter Zijlstra
Because the multi-pmu bits can share contexts between struct pmu instances we could get duplicate events by iterating the pmu list. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Merge branches 'x86-fixes-for-linus', 'perf-fixes-for-linus' and ↵Linus Torvalds
'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/pvclock: Zero last_value on resume * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf record: Fix eternal wait for stillborn child perf header: Don't assume there's no attr info if no sample ids is provided perf symbols: Figure out start address of kernel map from kallsyms perf symbols: Fix kallsyms kernel/module map splitting * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nohz: Fix printk_needs_cpu() return value on offline cpus printk: Fix wake_up_klogd() vs cpu hotplug
2010-12-07Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix incorrect proc spurious output
2010-12-06Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix memory corruption related to swap PM / Hibernate: Use async I/O when reading compressed hibernation image
2010-12-06PM / Hibernate: Fix memory corruption related to swapRafael J. Wysocki
There is a problem that swap pages allocated before the creation of a hibernation image can be released and used for storing the contents of different memory pages while the image is being saved. Since the kernel stored in the image doesn't know of that, it causes memory corruption to occur after resume from hibernation, especially on systems with relatively small RAM that need to swap often. This issue can be addressed by keeping the GFP_IOFS bits clear in gfp_allowed_mask during the entire hibernation, including the saving of the image, until the system is finally turned off or the hibernation is aborted. Unfortunately, for this purpose it's necessary to rework the way in which the hibernate and suspend code manipulates gfp_allowed_mask. This change is based on an earlier patch from Hugh Dickins. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondrej Zary <linux@rainbow-software.org> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: stable@kernel.org
2010-12-06PM / Hibernate: Use async I/O when reading compressed hibernation imageBojan Smojver
This is a fix for reading LZO compressed image using async I/O. Essentially, instead of having just one page into which we keep reading blocks from swap, we allocate enough of them to cover the largest compressed size and then let block I/O pick them all up. Once we have them all (and here we wait), we decompress them, as usual. Obviously, the very first block we still pick up synchronously, because we need to know the size of the lot before we pick up the rest. Also fixed the copyright line, which I've forgotten before. Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-12-02do_exit(): make sure that we run with get_fs() == USER_DSNelson Elhage
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c) do a put_user to a user-controlled address, potentially allowing a user to leverage an oops into a controlled write into kernel memory. This is only triggerable in the presence of another bug, but this potentially turns a lot of DoS bugs into privilege escalations, so it's worth fixing. I have proof-of-concept code which uses this bug along with CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've tested that this is not theoretical. A more logical place to put this fix might be when we know an oops has occurred, before we call do_exit(), but that would involve changing every architecture, in multiple places. Let's just stick it in do_exit instead. [akpm@linux-foundation.org: update code comment] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-01genirq: Fix incorrect proc spurious outputKenji Kaneshige
Since commit a1afb637(switch /proc/irq/*/spurious to seq_file) all /proc/irq/XX/spurious files show the information of irq 0. Current irq_spurious_proc_open() passes on NULL as the 3rd argument, which is used as an IRQ number in irq_spurious_proc_show(), to the single_open(). Because of this, all the /proc/irq/XX/spurious file shows IRQ 0 information regardless of the IRQ number. To fix the problem, irq_spurious_proc_open() must pass on the appropreate data (IRQ number) to single_open(). Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <4CF4B778.90604@jp.fujitsu.com> Cc: stable@kernel.org [2.6.33+] Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-11-30tracing: Fix panic when lseek() called on "trace" opened for writingSlava Pestov
The file_ops struct for the "trace" special file defined llseek as seq_lseek(). However, if the file was opened for writing only, seq_open() was not called, and the seek would dereference a null pointer, file->private_data. This patch introduces a new wrapper for seq_lseek() which checks if the file descriptor is opened for reading first. If not, it does nothing. Cc: <stable@kernel.org> Signed-off-by: Slava Pestov <slavapestov@google.com> LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30sched: Fix unregister_fair_sched_group()Paul Turner
In the flipping and flopping between calling unregister_fair_sched_group() on a per-cpu versus per-group basis we ended up in a bad state. Remove from the list for the passed cpu as opposed to some arbitrary index. ( This fixes explosions w/ autogroup as well as a group creation/destruction stress test. ) Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20101130005740.080828123@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-28Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix the software context switch counter perf, x86: Fixup Kconfig deps x86, perf, nmi: Disable perf if counters are not accessible perf: Fix inherit vs. context rotation bug
2010-11-27Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call
2010-11-27Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf symbols: Remove incorrect open-coded container_of() perf record: Handle restrictive permissions in /proc/{kallsyms,modules} x86/kprobes: Prevent kprobes to probe on save_args() irq_work: Drop cmpxchg() result perf: Fix owner-list vs exit x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG tracing: Fix recursive user stack trace perf,hw_breakpoint: Initialize hardware api earlier x86: Ignore trap bits on single step exceptions tracing: Force arch_local_irq_* notrace for paravirt tracing: Fix module use of trace_bprintk()
2010-11-27Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix idle balancing sched: Fix volanomark performance regression