summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-11-17Merge branches 'tracing/branch-tracer', 'tracing/ftrace', ↵Ingo Molnar
'tracing/function-return-tracer', 'tracing/tracepoints' and 'tracing/urgent' into tracing/core
2008-11-16stop_machine: fix race with return value (fixes Bug #11989)Rusty Russell
Bug #11989: Suspend failure on NForce4-based boards due to chanes in stop_machine We should not access active.fnret outside the lock; in theory the next stop_machine could overwrite it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-16markers/tracpoints: fix non-modular buildIngo Molnar
fix: kernel/marker.c: In function 'marker_module_notify': kernel/marker.c:905: error: 'MODULE_STATE_COMING' undeclared (first use in this function) [...] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()Mathieu Desnoyers
Impact: API *CHANGE*. Must update all tracepoint users. Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint structure in a single spot for all the kernel. It helps reducing memory consumption, especially when declaring a lot of tracepoints, e.g. for kmalloc tracing. *API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for tracepoint declarations rather than DEFINE_TRACE(). This is the sane way to do it. The name previously used was misleading. Updates scheduler instrumentation to follow this API change. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracepoints: use modules notifiersMathieu Desnoyers
Impact: cleanup Use module notifiers for tracepoint updates rather than adding a hook in module.c. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracepoints: fix disableMathieu Desnoyers
Impact: fix race Set the probe array pointer to NULL when the tracepoint is disabled. The probe array point not being NULL could generate a race condition where the reader would dereference a freed pointer. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16markers: auto enable tracepoints (new API : trace_mark_tp())Mathieu Desnoyers
Impact: new API Add a new API trace_mark_tp(), which declares a marker within a tracepoint probe. When the marker is activated, the tracepoint is automatically enabled. No branch test is used at the marker site, because it would be a duplicate of the branch already present in the tracepoint. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16markers: use module notifierMathieu Desnoyers
Impact: cleanup Use module notifiers instead of adding a hook in module.c. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16markers: use rcu_*_sched_notrace and notraceMathieu Desnoyers
Make marker critical code use notrace to make sure they can be used as an ftrace callback. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16markers: fix unregisterMathieu Desnoyers
Impact: fix marker registers/unregister race get_marker() can return a NULL entry because the mutex is released in the middle of those functions. Make sure we check to see if it has been concurrently removed. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16function tracing: fix wrong pos computing when read buffer has been fulfilledwalimis
Impact: make output of available_filter_functions complete phenomenon: The first value of dyn_ftrace_total_info is not equal with `cat available_filter_functions | wc -l`, but they should be equal. root cause: When printing functions with seq_printf in t_show, if the read buffer is just overflowed by current function record, then this function won't be printed to user space through read buffer, it will just be dropped. So we can't see this function printing. So, every time the last function to fill the read buffer, if overflowed, will be dropped. This also applies to set_ftrace_filter if set_ftrace_filter has more bytes than read buffer. fix: Through checking return value of seq_printf, if less than 0, we know this function doesn't be printed. Then we decrease position to force this function to be printed next time, in next read buffer. Another little fix is to show correct allocating pages count. Signed-off-by: walimis <walimisdev@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16sched: fix kernel warning on /proc/sched_debug accessIngo Molnar
Luis Henriques reported that with CONFIG_PREEMPT=y + CONFIG_PREEMPT_DEBUG=y + CONFIG_SCHED_DEBUG=y + CONFIG_LATENCYTOP=y enabled, the following warning triggers when using latencytop: > [ 775.663239] BUG: using smp_processor_id() in preemptible [00000000] code: latencytop/6585 > [ 775.663303] caller is native_sched_clock+0x3a/0x80 > [ 775.663314] Pid: 6585, comm: latencytop Tainted: G W 2.6.28-rc4-00355-g9c7c354 #1 > [ 775.663322] Call Trace: > [ 775.663343] [<ffffffff803a94e4>] debug_smp_processor_id+0xe4/0xf0 > [ 775.663356] [<ffffffff80213f7a>] native_sched_clock+0x3a/0x80 > [ 775.663368] [<ffffffff80213e19>] sched_clock+0x9/0x10 > [ 775.663381] [<ffffffff8024550d>] proc_sched_show_task+0x8bd/0x10e0 > [ 775.663395] [<ffffffff8034466e>] sched_show+0x3e/0x80 > [ 775.663408] [<ffffffff8031039b>] seq_read+0xdb/0x350 > [ 775.663421] [<ffffffff80368776>] ? security_file_permission+0x16/0x20 > [ 775.663435] [<ffffffff802f4198>] vfs_read+0xc8/0x170 > [ 775.663447] [<ffffffff802f4335>] sys_read+0x55/0x90 > [ 775.663460] [<ffffffff8020c67a>] system_call_fastpath+0x16/0x1b > ... This breakage was caused by me via: 7cbaef9: sched: optimize sched_clock() a bit Change the calls to cpu_clock(). Reported-by: Luis Henriques <henrix@sapo.pt>
2008-11-16tracing/function-return-tracer: support for dynamic ftrace on function ↵Frederic Weisbecker
return tracer This patch adds the support for dynamic tracing on the function return tracer. The whole difference with normal dynamic function tracing is that we don't need to hook on a particular callback. The only pro that we want is to nop or set dynamically the calls to ftrace_caller (which is ftrace_return_caller here). Some security checks ensure that we are not trying to launch dynamic tracing for return tracing while normal function tracing is already running. An example of trace with getnstimeofday set as a filter: ktime_get_ts+0x22/0x50 -> getnstimeofday (2283 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1396 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1825 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1426 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1524 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1434 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1502 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1404 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1397 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1051 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1314 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1344 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1163 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1390 ns) ktime_get_ts+0x22/0x50 -> getnstimeofday (1374 ns) Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracing/branch-tracer: fix a trace recursion on branch tracerFrederic Weisbecker
Impact: fix crash when enabling the branch-tracer When the branch tracer inserts an event through probe_likely_condition(), it calls local_irq_save() and then results in a trace recursion. local_irq_save() -> trace_hardirqs_off() -> trace_hardirqs_off_caller() -> unlikely() The trace_branch.c file is protected by DISABLE_BRANCH_PROFILING but that doesn't prevent from external call to functions that use unlikely(). My box crashed each time I tried to set this tracer (sudden and hard reboot). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracing/ftrace: change the type of the init() callbackFrederic Weisbecker
Impact: extend the ->init() method with the ability to fail This bring a way to know if the initialization of a tracer successed. A tracer must return 0 on success and a traditional error (ie: -ENOMEM) if it fails. If a tracer fails to init, it is free to print a detailed warn. The tracing api will not and switch to a new tracer will just return the error from the init callback. Note: this will be used for the return tracer. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16tracing/ftrace: fix unexpected -EINVAL when longest tracer name is setFrederic Weisbecker
Impact: fix confusing write() -EINVAL when changing the tracer The following commit d9e540762f5cdd89f24e518ad1fd31142d0b9726 remade alive the bug which made the set of a new tracer returning -EINVAL if this is the longest name of tracer. This patch corrects it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: make filtered functions effective on settingSteven Rostedt
Impact: set filtered functions at time the filter is set It can be confusing when the set_filter_functions is set (or cleared) and the functions being recorded by the dynamic tracer does not match. This patch causes the code to be updated if the function tracer is enabled and the filter is changed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: fix dyn ftrace filterSteven Rostedt
Impact: correct implementation of dyn ftrace filter The old decisions made by the filter algorithm was complex and incorrect. This lead to inconsistent enabling or disabling of functions when the filter was used. This patch simplifies that code and in doing so, corrects the usage of the filters. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: allow NULL pointers in mcount_locSteven Rostedt
Impact: make ftrace_convert_nops() more permissive Due to the way different architecture linkers combine the data sections of the mcount_loc (the section that lists all the locations that call mcount), there may be zeros added in that section. This is usually due to strange alignments that the linker performs, that pads in zeros. This patch makes the conversion code to nops skip any pointer in the mcount_loc section that is NULL. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: pass module struct to arch dynamic ftrace functionsSteven Rostedt
Impact: allow archs more flexibility on dynamic ftrace implementations Dynamic ftrace has largly been developed on x86. Since x86 does not have the same limitations as other architectures, the ftrace interaction between the generic code and the architecture specific code was not flexible enough to handle some of the issues that other architectures have. Most notably, module trampolines. Due to the limited branch distance that archs make in calling kernel core code from modules, the module load code must create a trampoline to jump to what will make the larger jump into core kernel code. The problem arises when this happens to a call to mcount. Ftrace checks all code before modifying it and makes sure the current code is what it expects. Right now, there is not enough information to handle modifying module trampolines. This patch changes the API between generic dynamic ftrace code and the arch dependent code. There is now two functions for modifying code: ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into a nop, where the original text is calling addr. (mod is the module struct if called by module init) ftrace_make_caller(rec, addr) - convert the code rec->ip that should be a nop into a caller to addr. The record "rec" now has a new field called "arch" where the architecture can add any special attributes to each call site record. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: replace raw_local_irq_save with local_irq_saveSteven Rostedt
Impact: fix lockdep disabling itself when function tracing is enabled The raw_local_irq_saves used in ftrace is causing problems with lockdep. (it thinks the irq flags are out of sync and disables itself with a warning) The raw ops here are not needed, and the normal local_irq_save is fine. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: do not process freed recordsSteven Rostedt
Impact: keep from converting freed records When the tracer is started or stopped, it converts all code pointed to by the saved records into callers to ftrace or nops. When modules are unloaded, their records are freed, but they still exist within the record pages. This patch changes the code to skip over freed records. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: disable ftrace on anomalies in trace start and stopSteven Rostedt
Impact: robust feature to disable ftrace on start or stop tracing on error Currently only the initial conversion to nops will disable ftrace on an anomaly. But if an anomaly happens on start or stopping of the tracer, it will silently fail. This patch adds a check there too, to disable ftrace and warn if the conversion fails. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16ftrace: remove condition from ftrace_record_ipSteven Rostedt
Impact: let module functions be recorded when dyn ftrace not enabled When dynamic ftrace had a daemon and a hash to record the locations of mcount callers at run time, the recording needed to stop when ftrace was disabled. But now that the recording is done at compile time and the ftrace_record_ip is only called at boot up and when a module is loaded, we no longer need to check if ftrace_enabled is set. In fact, this breaks module load if it is not set because we skip over module functions. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16Merge branches 'tracing/fastboot', 'tracing/ftrace' and 'tracing/urgent' ↵Ingo Molnar
into tracing/core
2008-11-15Fix inotify watch removal/umount racesAl Viro
Inotify watch removals suck violently. To kick the watch out we need (in this order) inode->inotify_mutex and ih->mutex. That's fine if we have a hold on inode; however, for all other cases we need to make damn sure we don't race with umount. We can *NOT* just grab a reference to a watch - inotify_unmount_inodes() will happily sail past it and we'll end with reference to inode potentially outliving its superblock. Ideally we just want to grab an active reference to superblock if we can; that will make sure we won't go into inotify_umount_inodes() until we are done. Cleanup is just deactivate_super(). However, that leaves a messy case - what if we *are* racing with umount() and active references to superblock can't be acquired anymore? We can bump ->s_count, grab ->s_umount, which will almost certainly wait until the superblock is shut down and the watch in question is pining for fjords. That's fine, but there is a problem - we might have hit the window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e. the moment when superblock is past the point of no return and is heading for shutdown) and the moment when deactivate_super() acquires ->s_umount. We could just do drop_super() yield() and retry, but that's rather antisocial and this stuff is luser-triggerable. OTOH, having grabbed ->s_umount and having found that we'd got there first (i.e. that ->s_root is non-NULL) we know that we won't race with inotify_umount_inodes(). So we could grab a reference to watch and do the rest as above, just with drop_super() instead of deactivate_super(), right? Wrong. We had to drop ih->mutex before we could grab ->s_umount. So the watch could've been gone already. That still can be dealt with - we need to save watch->wd, do idr_find() and compare its result with our pointer. If they match, we either have the damn thing still alive or we'd lost not one but two races at once, the watch had been killed and a new one got created with the same ->wd at the same address. That couldn't have happened in inotify_destroy(), but inotify_rm_wd() could run into that. Still, "new one got created" is not a problem - we have every right to kill it or leave it alone, whatever's more convenient. So we can use idr_find(...) == watch && watch->inode->i_sb == sb as "grab it and kill it" check. If it's been our original watch, we are fine, if it's a newcomer - nevermind, just pretend that we'd won the race and kill the fscker anyway; we are safe since we know that its superblock won't be going away. And yes, this is far beyond mere "not very pretty"; so's the entire concept of inotify to start with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-15Move "exit_robust_list" into mm_release()Linus Torvalds
We don't want to get rid of the futexes just at exit() time, we want to drop them when doing an execve() too, since that gets rid of the previous VM image too. Doing it at mm_release() time means that we automatically always do it when we disassociate a VM map from the task. Reported-by: pageexec@freemail.hu Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Alex Efros <powerman@powerman.name> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-14Merge branch 'master' into nextJames Morris
Conflicts: security/keys/internal.h security/keys/process_keys.c security/keys/request_key.c Fixed conflicts above by using the non 'tsk' versions. Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Allow kernel services to override LSM settings for task actionsDavid Howells
Allow kernel services to override LSM settings appropriate to the actions performed by a task by duplicating a set of credentials, modifying it and then using task_struct::cred to point to it when performing operations on behalf of a task. This is used, for example, by CacheFiles which has to transparently access the cache on behalf of a process that thinks it is doing, say, NFS accesses with a potentially inappropriate (with respect to accessing the cache) set of credentials. This patch provides two LSM hooks for modifying a task security record: (*) security_kernel_act_as() which allows modification of the security datum with which a task acts on other objects (most notably files). (*) security_kernel_create_files_as() which allows modification of the security datum that is used to initialise the security data on a file that a task creates. The patch also provides four new credentials handling functions, which wrap the LSM functions: (1) prepare_kernel_cred() Prepare a set of credentials for a kernel service to use, based either on a daemon's credentials or on init_cred. All the keyrings are cleared. (2) set_security_override() Set the LSM security ID in a set of credentials to a specific security context, assuming permission from the LSM policy. (3) set_security_override_from_ctx() As (2), but takes the security context as a string. (4) set_create_files_as() Set the file creation LSM security ID in a set of credentials to be the same as that on a particular inode. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [Smack changes] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Differentiate objective and effective subjective credentials on a taskDavid Howells
Differentiate the objective and real subjective credentials from the effective subjective credentials on a task by introducing a second credentials pointer into the task_struct. task_struct::real_cred then refers to the objective and apparent real subjective credentials of a task, as perceived by the other tasks in the system. task_struct::cred then refers to the effective subjective credentials of a task, as used by that task when it's actually running. These are not visible to the other tasks in the system. __task_cred(task) then refers to the objective/real credentials of the task in question. current_cred() refers to the effective subjective credentials of the current task. prepare_creds() uses the objective creds as a base and commit_creds() changes both pointers in the task_struct (indeed commit_creds() requires them to be the same). override_creds() and revert_creds() change the subjective creds pointer only, and the former returns the old subjective creds. These are used by NFSD, faccessat() and do_coredump(), and will by used by CacheFiles. In SELinux, current_has_perm() is provided as an alternative to task_has_perm(). This uses the effective subjective context of current, whereas task_has_perm() uses the objective/real context of the subject. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: DocumentationDavid Howells
Document credentials and the new credentials API. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Make execve() take advantage of copy-on-write credentialsDavid Howells
Make execve() take advantage of copy-on-write credentials, allowing it to set up the credentials in advance, and then commit the whole lot after the point of no return. This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). The credential bits from struct linux_binprm are, for the most part, replaced with a single credentials pointer (bprm->cred). This means that all the creds can be calculated in advance and then applied at the point of no return with no possibility of failure. I would like to replace bprm->cap_effective with: cap_isclear(bprm->cap_effective) but this seems impossible due to special behaviour for processes of pid 1 (they always retain their parent's capability masks where normally they'd be changed - see cap_bprm_set_creds()). The following sequence of events now happens: (a) At the start of do_execve, the current task's cred_exec_mutex is locked to prevent PTRACE_ATTACH from obsoleting the calculation of creds that we make. (a) prepare_exec_creds() is then called to make a copy of the current task's credentials and prepare it. This copy is then assigned to bprm->cred. This renders security_bprm_alloc() and security_bprm_free() unnecessary, and so they've been removed. (b) The determination of unsafe execution is now performed immediately after (a) rather than later on in the code. The result is stored in bprm->unsafe for future reference. (c) prepare_binprm() is called, possibly multiple times. (i) This applies the result of set[ug]id binaries to the new creds attached to bprm->cred. Personality bit clearance is recorded, but now deferred on the basis that the exec procedure may yet fail. (ii) This then calls the new security_bprm_set_creds(). This should calculate the new LSM and capability credentials into *bprm->cred. This folds together security_bprm_set() and parts of security_bprm_apply_creds() (these two have been removed). Anything that might fail must be done at this point. (iii) bprm->cred_prepared is set to 1. bprm->cred_prepared is 0 on the first pass of the security calculations, and 1 on all subsequent passes. This allows SELinux in (ii) to base its calculations only on the initial script and not on the interpreter. (d) flush_old_exec() is called to commit the task to execution. This performs the following steps with regard to credentials: (i) Clear pdeath_signal and set dumpable on certain circumstances that may not be covered by commit_creds(). (ii) Clear any bits in current->personality that were deferred from (c.i). (e) install_exec_creds() [compute_creds() as was] is called to install the new credentials. This performs the following steps with regard to credentials: (i) Calls security_bprm_committing_creds() to apply any security requirements, such as flushing unauthorised files in SELinux, that must be done before the credentials are changed. This is made up of bits of security_bprm_apply_creds() and security_bprm_post_apply_creds(), both of which have been removed. This function is not allowed to fail; anything that might fail must have been done in (c.ii). (ii) Calls commit_creds() to apply the new credentials in a single assignment (more or less). Possibly pdeath_signal and dumpable should be part of struct creds. (iii) Unlocks the task's cred_replace_mutex, thus allowing PTRACE_ATTACH to take place. (iv) Clears The bprm->cred pointer as the credentials it was holding are now immutable. (v) Calls security_bprm_committed_creds() to apply any security alterations that must be done after the creds have been changed. SELinux uses this to flush signals and signal handlers. (f) If an error occurs before (d.i), bprm_free() will call abort_creds() to destroy the proposed new credentials and will then unlock cred_replace_mutex. No changes to the credentials will have been made. (2) LSM interface. A number of functions have been changed, added or removed: (*) security_bprm_alloc(), ->bprm_alloc_security() (*) security_bprm_free(), ->bprm_free_security() Removed in favour of preparing new credentials and modifying those. (*) security_bprm_apply_creds(), ->bprm_apply_creds() (*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds() Removed; split between security_bprm_set_creds(), security_bprm_committing_creds() and security_bprm_committed_creds(). (*) security_bprm_set(), ->bprm_set_security() Removed; folded into security_bprm_set_creds(). (*) security_bprm_set_creds(), ->bprm_set_creds() New. The new credentials in bprm->creds should be checked and set up as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the second and subsequent calls. (*) security_bprm_committing_creds(), ->bprm_committing_creds() (*) security_bprm_committed_creds(), ->bprm_committed_creds() New. Apply the security effects of the new credentials. This includes closing unauthorised files in SELinux. This function may not fail. When the former is called, the creds haven't yet been applied to the process; when the latter is called, they have. The former may access bprm->cred, the latter may not. (3) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) The bprm_security_struct struct has been removed in favour of using the credentials-under-construction approach. (c) flush_unauthorized_files() now takes a cred pointer and passes it on to inode_has_perm(), file_has_perm() and dentry_open(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Inaugurate COW credentialsDavid Howells
Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Rename is_single_threaded() to is_wq_single_threaded()David Howells
Rename is_single_threaded() to is_wq_single_threaded() so that a new is_single_threaded() can be created that refers to tasks rather than waitqueues. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Separate per-task-group keyrings from signal_structDavid Howells
Separate per-task-group keyrings from signal_struct and dangle their anchor from the cred struct rather than the signal_struct. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Use RCU to access another task's creds and to release a task's own credsDavid Howells
Use RCU to access another task's creds and to release a task's own creds. This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Wrap current->cred and a few other accessorsDavid Howells
Wrap current->cred and a few other accessors to hide their actual implementation. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Detach the credentials from task_structDavid Howells
Detach the credentials from task_struct, duplicating them in copy_process() and releasing them in __put_task_struct(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Separate task security context from task_structDavid Howells
Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Neuter sys_capset()David Howells
Take away the ability for sys_capset() to affect processes other than current. This means that current will not need to lock its own credentials when reading them against interference by other processes. This has effectively been the case for a while anyway, since: (1) Without LSM enabled, sys_capset() is disallowed. (2) With file-based capabilities, sys_capset() is neutered. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14KEYS: Alter use of key instantiation link-to-keyring argumentDavid Howells
Alter the use of the key instantiation and negation functions' link-to-keyring arguments. Currently this specifies a keyring in the target process to link the key into, creating the keyring if it doesn't exist. This, however, can be a problem for copy-on-write credentials as it means that the instantiating process can alter the credentials of the requesting process. This patch alters the behaviour such that: (1) If keyctl_instantiate_key() or keyctl_negate_key() are given a specific keyring by ID (ringid >= 0), then that keyring will be used. (2) If keyctl_instantiate_key() or keyctl_negate_key() are given one of the special constants that refer to the requesting process's keyrings (KEY_SPEC_*_KEYRING, all <= 0), then: (a) If sys_request_key() was given a keyring to use (destringid) then the key will be attached to that keyring. (b) If sys_request_key() was given a NULL keyring, then the key being instantiated will be attached to the default keyring as set by keyctl_set_reqkey_keyring(). (3) No extra link will be made. Decision point (1) follows current behaviour, and allows those instantiators who've searched for a specifically named keyring in the requestor's keyring so as to partition the keys by type to still have their named keyrings. Decision point (2) allows the requestor to make sure that the key or keys that get produced by request_key() go where they want, whilst allowing the instantiator to request that the key is retained. This is mainly useful for situations where the instantiator makes a secondary request, the key for which should be retained by the initial requestor: +-----------+ +--------------+ +--------------+ | | | | | | | Requestor |------->| Instantiator |------->| Instantiator | | | | | | | +-----------+ +--------------+ +--------------+ request_key() request_key() This might be useful, for example, in Kerberos, where the requestor requests a ticket, and then the ticket instantiator requests the TGT, which someone else then has to go and fetch. The TGT, however, should be retained in the keyrings of the requestor, not the first instantiator. To make this explict an extra special keyring constant is also added. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14CRED: Wrap task credential accesses in the core kernelDavid Howells
Wrap access to task credentials so that they can be separated more easily from the task_struct during the introduction of COW creds. Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id(). Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more sense to use RCU directly rather than a convenient wrapper; these will be addressed by later patches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-audit@redhat.com Cc: containers@lists.linux-foundation.org Cc: linux-mm@kvack.org Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13ftrace: remove unnecessary if condition of __unregister_ftrace_functionwalimis
Because it has goto out before ftrace_list == &ftrace_list_end, that's to say, we never meet this condition. Signed-off-by: walimis <walimisdev@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13tracing: fix mmiotrace resizing crashIngo Molnar
Pekka reported a crash when resizing the mmiotrace tracer (if only mmiotrace is enabled). This happens because in that case we do not allocate the max buffer, but we try to use it. Make ring_buffer_resize() idempotent against NULL buffers. Reported-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13genirq: __irq_set_trigger: change pr_warning to pr_debugMark Nelson
Commit 0c5d1eb77a8be917b638344a22afe1398236482b (genirq: record trigger type) caused powerpc platforms that had no set_type() function in their struct irq_chip to spew out warnings about "No set_type function for IRQ...". This warning isn't necessarily justified though because the generic powerpc platform code calls set_irq_type() (which in turn calls __irq_set_trigger) with information from the device tree to establish the interrupt mappings, regardless of whether the PIC can actually set a type. A platform's irq_chip might not have a set_type function for a variety of reasons, for example: the platform may have the type essentially hard-coded, or as in the case for Cell interrupts are just messages past around that have no real concept of type, or the platform could even have a virtual PIC as on the PS3. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13ftrace: CPU buffer start annotation clean upsSteven Rostedt
Impact: better handling of CPU buffer start annotation Because of the confusion with the per CPU buffers wrapping where one CPU might be more active at the end of the trace than the other CPUs causing that one CPU to have a shorter history. Kernel developers were confused by the "missing" data of that one CPU at the beginning of the trace output. An annotation was added to the trace output to show that the buffer had started: # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | ##### CPU 3 buffer started #### <idle>-0 [003] 158.192959: smp_apic_timer_interrupt [...] <idle>-0 [003] 161.556520: default_idle ##### CPU 1 buffer started #### <idle>-0 [001] 161.592494: hrtimer_force_reprogram [etc] But this annotation gets a bit messy when tracers do not fill the buffers. This patch does a couple of things: One) it adds a flag to trace_options to disable these annotations Two) it does not annotate if the tracer did not overflow its buffer. This makes the output much cleaner. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13ftrace: rename iter_ctrl to trace_optionsSteven Rostedt
Impact: rename file /debug/tracing/iter_ctrl to /debug/tracing/trace_options The original ftrace had a file called "iter_ctrl" that would control the way the output was iterated. But this file grew into a catch all for different trace options. This patch renames the file from iter_ctrl to trace_options to reflect this change. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13ftrace: show buffer size in kilobytesSteven Rostedt
Impact: change the units of buffer_size_kb to kilobytes This patch changes the units of the buffer_size_kb file to kilobytes. Reading and writing to the file uses kilobytes as units. To help users to know what units are used, the output of the file now looks like: # cat /debug/tracing/buffer_size_kb 1408 Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13ftrace: rename trace_entries to buffer_size_kbSteven Rostedt
Impact: rename of debugfs file trace_entries to buffer_size_kb The original ftrace had fixed size entries, and the number of entries was shown and modified via the file called trace_entries. By converting to the unified trace buffer, we now allow for variable size entries which makes the meaning of trace_entries pointless. Since trace_size might be confused to the size of the trace, this patch names it "buffer_size_kb" (thanks to Arjan van de Ven for this idea). [ mingo@elte.hu: changed from buffer_size to buffer_size_kb ] ( Note, the units are still bytes - the next patch changes that, to keep the wide rename patch separate from the unit-change patch. ) Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-13Merge branches 'tracing/branch-tracer', 'tracing/fastboot', ↵Ingo Molnar
'tracing/function-return-tracer' and 'tracing/urgent' into tracing/core