diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /kernel/acct.c | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'kernel/acct.c')
-rw-r--r-- | kernel/acct.c | 623 |
1 files changed, 623 insertions, 0 deletions
diff --git a/kernel/acct.c b/kernel/acct.c new file mode 100644 index 000000000..010667ce6 --- /dev/null +++ b/kernel/acct.c @@ -0,0 +1,623 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * linux/kernel/acct.c + * + * BSD Process Accounting for Linux + * + * Author: Marco van Wieringen <mvw@planets.elm.net> + * + * Some code based on ideas and code from: + * Thomas K. Dyas <tdyas@eden.rutgers.edu> + * + * This file implements BSD-style process accounting. Whenever any + * process exits, an accounting record of type "struct acct" is + * written to the file specified with the acct() system call. It is + * up to user-level programs to do useful things with the accounting + * log. The kernel just provides the raw accounting information. + * + * (C) Copyright 1995 - 1997 Marco van Wieringen - ELM Consultancy B.V. + * + * Plugged two leaks. 1) It didn't return acct_file into the free_filps if + * the file happened to be read-only. 2) If the accounting was suspended + * due to the lack of space it happily allowed to reopen it and completely + * lost the old acct_file. 3/10/98, Al Viro. + * + * Now we silently close acct_file on attempt to reopen. Cleaned sys_acct(). + * XTerms and EMACS are manifestations of pure evil. 21/10/98, AV. + * + * Fixed a nasty interaction with sys_umount(). If the accounting + * was suspeneded we failed to stop it on umount(). Messy. + * Another one: remount to readonly didn't stop accounting. + * Question: what should we do if we have CAP_SYS_ADMIN but not + * CAP_SYS_PACCT? Current code does the following: umount returns -EBUSY + * unless we are messing with the root. In that case we are getting a + * real mess with do_remount_sb(). 9/11/98, AV. + * + * Fixed a bunch of races (and pair of leaks). Probably not the best way, + * but this one obviously doesn't introduce deadlocks. Later. BTW, found + * one race (and leak) in BSD implementation. + * OK, that's better. ANOTHER race and leak in BSD variant. There always + * is one more bug... 10/11/98, AV. + * + * Oh, fsck... Oopsable SMP race in do_process_acct() - we must hold + * ->mmap_lock to walk the vma list of current->mm. Nasty, since it leaks + * a struct file opened for write. Fixed. 2/6/2000, AV. + */ + +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/acct.h> +#include <linux/capability.h> +#include <linux/file.h> +#include <linux/tty.h> +#include <linux/security.h> +#include <linux/vfs.h> +#include <linux/jiffies.h> +#include <linux/times.h> +#include <linux/syscalls.h> +#include <linux/mount.h> +#include <linux/uaccess.h> +#include <linux/sched/cputime.h> + +#include <asm/div64.h> +#include <linux/pid_namespace.h> +#include <linux/fs_pin.h> + +/* + * These constants control the amount of freespace that suspend and + * resume the process accounting system, and the time delay between + * each check. + * Turned into sysctl-controllable parameters. AV, 12/11/98 + */ + +static int acct_parm[3] = {4, 2, 30}; +#define RESUME (acct_parm[0]) /* >foo% free space - resume */ +#define SUSPEND (acct_parm[1]) /* <foo% free space - suspend */ +#define ACCT_TIMEOUT (acct_parm[2]) /* foo second timeout between checks */ + +#ifdef CONFIG_SYSCTL +static struct ctl_table kern_acct_table[] = { + { + .procname = "acct", + .data = &acct_parm, + .maxlen = 3*sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { } +}; + +static __init int kernel_acct_sysctls_init(void) +{ + register_sysctl_init("kernel", kern_acct_table); + return 0; +} +late_initcall(kernel_acct_sysctls_init); +#endif /* CONFIG_SYSCTL */ + +/* + * External references and all of the globals. + */ + +struct bsd_acct_struct { + struct fs_pin pin; + atomic_long_t count; + struct rcu_head rcu; + struct mutex lock; + int active; + unsigned long needcheck; + struct file *file; + struct pid_namespace *ns; + struct work_struct work; + struct completion done; +}; + +static void do_acct_process(struct bsd_acct_struct *acct); + +/* + * Check the amount of free space and suspend/resume accordingly. + */ +static int check_free_space(struct bsd_acct_struct *acct) +{ + struct kstatfs sbuf; + + if (time_is_after_jiffies(acct->needcheck)) + goto out; + + /* May block */ + if (vfs_statfs(&acct->file->f_path, &sbuf)) + goto out; + + if (acct->active) { + u64 suspend = sbuf.f_blocks * SUSPEND; + do_div(suspend, 100); + if (sbuf.f_bavail <= suspend) { + acct->active = 0; + pr_info("Process accounting paused\n"); + } + } else { + u64 resume = sbuf.f_blocks * RESUME; + do_div(resume, 100); + if (sbuf.f_bavail >= resume) { + acct->active = 1; + pr_info("Process accounting resumed\n"); + } + } + + acct->needcheck = jiffies + ACCT_TIMEOUT*HZ; +out: + return acct->active; +} + +static void acct_put(struct bsd_acct_struct *p) +{ + if (atomic_long_dec_and_test(&p->count)) + kfree_rcu(p, rcu); +} + +static inline struct bsd_acct_struct *to_acct(struct fs_pin *p) +{ + return p ? container_of(p, struct bsd_acct_struct, pin) : NULL; +} + +static struct bsd_acct_struct *acct_get(struct pid_namespace *ns) +{ + struct bsd_acct_struct *res; +again: + smp_rmb(); + rcu_read_lock(); + res = to_acct(READ_ONCE(ns->bacct)); + if (!res) { + rcu_read_unlock(); + return NULL; + } + if (!atomic_long_inc_not_zero(&res->count)) { + rcu_read_unlock(); + cpu_relax(); + goto again; + } + rcu_read_unlock(); + mutex_lock(&res->lock); + if (res != to_acct(READ_ONCE(ns->bacct))) { + mutex_unlock(&res->lock); + acct_put(res); + goto again; + } + return res; +} + +static void acct_pin_kill(struct fs_pin *pin) +{ + struct bsd_acct_struct *acct = to_acct(pin); + mutex_lock(&acct->lock); + do_acct_process(acct); + schedule_work(&acct->work); + wait_for_completion(&acct->done); + cmpxchg(&acct->ns->bacct, pin, NULL); + mutex_unlock(&acct->lock); + pin_remove(pin); + acct_put(acct); +} + +static void close_work(struct work_struct *work) +{ + struct bsd_acct_struct *acct = container_of(work, struct bsd_acct_struct, work); + struct file *file = acct->file; + if (file->f_op->flush) + file->f_op->flush(file, NULL); + __fput_sync(file); + complete(&acct->done); +} + +static int acct_on(struct filename *pathname) +{ + struct file *file; + struct vfsmount *mnt, *internal; + struct pid_namespace *ns = task_active_pid_ns(current); + struct bsd_acct_struct *acct; + struct fs_pin *old; + int err; + + acct = kzalloc(sizeof(struct bsd_acct_struct), GFP_KERNEL); + if (!acct) + return -ENOMEM; + + /* Difference from BSD - they don't do O_APPEND */ + file = file_open_name(pathname, O_WRONLY|O_APPEND|O_LARGEFILE, 0); + if (IS_ERR(file)) { + kfree(acct); + return PTR_ERR(file); + } + + if (!S_ISREG(file_inode(file)->i_mode)) { + kfree(acct); + filp_close(file, NULL); + return -EACCES; + } + + if (!(file->f_mode & FMODE_CAN_WRITE)) { + kfree(acct); + filp_close(file, NULL); + return -EIO; + } + internal = mnt_clone_internal(&file->f_path); + if (IS_ERR(internal)) { + kfree(acct); + filp_close(file, NULL); + return PTR_ERR(internal); + } + err = __mnt_want_write(internal); + if (err) { + mntput(internal); + kfree(acct); + filp_close(file, NULL); + return err; + } + mnt = file->f_path.mnt; + file->f_path.mnt = internal; + + atomic_long_set(&acct->count, 1); + init_fs_pin(&acct->pin, acct_pin_kill); + acct->file = file; + acct->needcheck = jiffies; + acct->ns = ns; + mutex_init(&acct->lock); + INIT_WORK(&acct->work, close_work); + init_completion(&acct->done); + mutex_lock_nested(&acct->lock, 1); /* nobody has seen it yet */ + pin_insert(&acct->pin, mnt); + + rcu_read_lock(); + old = xchg(&ns->bacct, &acct->pin); + mutex_unlock(&acct->lock); + pin_kill(old); + __mnt_drop_write(mnt); + mntput(mnt); + return 0; +} + +static DEFINE_MUTEX(acct_on_mutex); + +/** + * sys_acct - enable/disable process accounting + * @name: file name for accounting records or NULL to shutdown accounting + * + * sys_acct() is the only system call needed to implement process + * accounting. It takes the name of the file where accounting records + * should be written. If the filename is NULL, accounting will be + * shutdown. + * + * Returns: 0 for success or negative errno values for failure. + */ +SYSCALL_DEFINE1(acct, const char __user *, name) +{ + int error = 0; + + if (!capable(CAP_SYS_PACCT)) + return -EPERM; + + if (name) { + struct filename *tmp = getname(name); + + if (IS_ERR(tmp)) + return PTR_ERR(tmp); + mutex_lock(&acct_on_mutex); + error = acct_on(tmp); + mutex_unlock(&acct_on_mutex); + putname(tmp); + } else { + rcu_read_lock(); + pin_kill(task_active_pid_ns(current)->bacct); + } + + return error; +} + +void acct_exit_ns(struct pid_namespace *ns) +{ + rcu_read_lock(); + pin_kill(ns->bacct); +} + +/* + * encode an u64 into a comp_t + * + * This routine has been adopted from the encode_comp_t() function in + * the kern_acct.c file of the FreeBSD operating system. The encoding + * is a 13-bit fraction with a 3-bit (base 8) exponent. + */ + +#define MANTSIZE 13 /* 13 bit mantissa. */ +#define EXPSIZE 3 /* Base 8 (3 bit) exponent. */ +#define MAXFRACT ((1 << MANTSIZE) - 1) /* Maximum fractional value. */ + +static comp_t encode_comp_t(u64 value) +{ + int exp, rnd; + + exp = rnd = 0; + while (value > MAXFRACT) { + rnd = value & (1 << (EXPSIZE - 1)); /* Round up? */ + value >>= EXPSIZE; /* Base 8 exponent == 3 bit shift. */ + exp++; + } + + /* + * If we need to round up, do it (and handle overflow correctly). + */ + if (rnd && (++value > MAXFRACT)) { + value >>= EXPSIZE; + exp++; + } + + if (exp > (((comp_t) ~0U) >> MANTSIZE)) + return (comp_t) ~0U; + /* + * Clean it up and polish it off. + */ + exp <<= MANTSIZE; /* Shift the exponent into place */ + exp += value; /* and add on the mantissa. */ + return exp; +} + +#if ACCT_VERSION == 1 || ACCT_VERSION == 2 +/* + * encode an u64 into a comp2_t (24 bits) + * + * Format: 5 bit base 2 exponent, 20 bits mantissa. + * The leading bit of the mantissa is not stored, but implied for + * non-zero exponents. + * Largest encodable value is 50 bits. + */ + +#define MANTSIZE2 20 /* 20 bit mantissa. */ +#define EXPSIZE2 5 /* 5 bit base 2 exponent. */ +#define MAXFRACT2 ((1ul << MANTSIZE2) - 1) /* Maximum fractional value. */ +#define MAXEXP2 ((1 << EXPSIZE2) - 1) /* Maximum exponent. */ + +static comp2_t encode_comp2_t(u64 value) +{ + int exp, rnd; + + exp = (value > (MAXFRACT2>>1)); + rnd = 0; + while (value > MAXFRACT2) { + rnd = value & 1; + value >>= 1; + exp++; + } + + /* + * If we need to round up, do it (and handle overflow correctly). + */ + if (rnd && (++value > MAXFRACT2)) { + value >>= 1; + exp++; + } + + if (exp > MAXEXP2) { + /* Overflow. Return largest representable number instead. */ + return (1ul << (MANTSIZE2+EXPSIZE2-1)) - 1; + } else { + return (value & (MAXFRACT2>>1)) | (exp << (MANTSIZE2-1)); + } +} +#elif ACCT_VERSION == 3 +/* + * encode an u64 into a 32 bit IEEE float + */ +static u32 encode_float(u64 value) +{ + unsigned exp = 190; + unsigned u; + + if (value == 0) + return 0; + while ((s64)value > 0) { + value <<= 1; + exp--; + } + u = (u32)(value >> 40) & 0x7fffffu; + return u | (exp << 23); +} +#endif + +/* + * Write an accounting entry for an exiting process + * + * The acct_process() call is the workhorse of the process + * accounting system. The struct acct is built here and then written + * into the accounting file. This function should only be called from + * do_exit() or when switching to a different output file. + */ + +static void fill_ac(acct_t *ac) +{ + struct pacct_struct *pacct = ¤t->signal->pacct; + u64 elapsed, run_time; + time64_t btime; + struct tty_struct *tty; + + /* + * Fill the accounting struct with the needed info as recorded + * by the different kernel functions. + */ + memset(ac, 0, sizeof(acct_t)); + + ac->ac_version = ACCT_VERSION | ACCT_BYTEORDER; + strlcpy(ac->ac_comm, current->comm, sizeof(ac->ac_comm)); + + /* calculate run_time in nsec*/ + run_time = ktime_get_ns(); + run_time -= current->group_leader->start_time; + /* convert nsec -> AHZ */ + elapsed = nsec_to_AHZ(run_time); +#if ACCT_VERSION == 3 + ac->ac_etime = encode_float(elapsed); +#else + ac->ac_etime = encode_comp_t(elapsed < (unsigned long) -1l ? + (unsigned long) elapsed : (unsigned long) -1l); +#endif +#if ACCT_VERSION == 1 || ACCT_VERSION == 2 + { + /* new enlarged etime field */ + comp2_t etime = encode_comp2_t(elapsed); + + ac->ac_etime_hi = etime >> 16; + ac->ac_etime_lo = (u16) etime; + } +#endif + do_div(elapsed, AHZ); + btime = ktime_get_real_seconds() - elapsed; + ac->ac_btime = clamp_t(time64_t, btime, 0, U32_MAX); +#if ACCT_VERSION==2 + ac->ac_ahz = AHZ; +#endif + + spin_lock_irq(¤t->sighand->siglock); + tty = current->signal->tty; /* Safe as we hold the siglock */ + ac->ac_tty = tty ? old_encode_dev(tty_devnum(tty)) : 0; + ac->ac_utime = encode_comp_t(nsec_to_AHZ(pacct->ac_utime)); + ac->ac_stime = encode_comp_t(nsec_to_AHZ(pacct->ac_stime)); + ac->ac_flag = pacct->ac_flag; + ac->ac_mem = encode_comp_t(pacct->ac_mem); + ac->ac_minflt = encode_comp_t(pacct->ac_minflt); + ac->ac_majflt = encode_comp_t(pacct->ac_majflt); + ac->ac_exitcode = pacct->ac_exitcode; + spin_unlock_irq(¤t->sighand->siglock); +} +/* + * do_acct_process does all actual work. Caller holds the reference to file. + */ +static void do_acct_process(struct bsd_acct_struct *acct) +{ + acct_t ac; + unsigned long flim; + const struct cred *orig_cred; + struct file *file = acct->file; + + /* + * Accounting records are not subject to resource limits. + */ + flim = rlimit(RLIMIT_FSIZE); + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; + /* Perform file operations on behalf of whoever enabled accounting */ + orig_cred = override_creds(file->f_cred); + + /* + * First check to see if there is enough free_space to continue + * the process accounting system. + */ + if (!check_free_space(acct)) + goto out; + + fill_ac(&ac); + /* we really need to bite the bullet and change layout */ + ac.ac_uid = from_kuid_munged(file->f_cred->user_ns, orig_cred->uid); + ac.ac_gid = from_kgid_munged(file->f_cred->user_ns, orig_cred->gid); +#if ACCT_VERSION == 1 || ACCT_VERSION == 2 + /* backward-compatible 16 bit fields */ + ac.ac_uid16 = ac.ac_uid; + ac.ac_gid16 = ac.ac_gid; +#elif ACCT_VERSION == 3 + { + struct pid_namespace *ns = acct->ns; + + ac.ac_pid = task_tgid_nr_ns(current, ns); + rcu_read_lock(); + ac.ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent), + ns); + rcu_read_unlock(); + } +#endif + /* + * Get freeze protection. If the fs is frozen, just skip the write + * as we could deadlock the system otherwise. + */ + if (file_start_write_trylock(file)) { + /* it's been opened O_APPEND, so position is irrelevant */ + loff_t pos = 0; + __kernel_write(file, &ac, sizeof(acct_t), &pos); + file_end_write(file); + } +out: + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim; + revert_creds(orig_cred); +} + +/** + * acct_collect - collect accounting information into pacct_struct + * @exitcode: task exit code + * @group_dead: not 0, if this thread is the last one in the process. + */ +void acct_collect(long exitcode, int group_dead) +{ + struct pacct_struct *pacct = ¤t->signal->pacct; + u64 utime, stime; + unsigned long vsize = 0; + + if (group_dead && current->mm) { + struct mm_struct *mm = current->mm; + VMA_ITERATOR(vmi, mm, 0); + struct vm_area_struct *vma; + + mmap_read_lock(mm); + for_each_vma(vmi, vma) + vsize += vma->vm_end - vma->vm_start; + mmap_read_unlock(mm); + } + + spin_lock_irq(¤t->sighand->siglock); + if (group_dead) + pacct->ac_mem = vsize / 1024; + if (thread_group_leader(current)) { + pacct->ac_exitcode = exitcode; + if (current->flags & PF_FORKNOEXEC) + pacct->ac_flag |= AFORK; + } + if (current->flags & PF_SUPERPRIV) + pacct->ac_flag |= ASU; + if (current->flags & PF_DUMPCORE) + pacct->ac_flag |= ACORE; + if (current->flags & PF_SIGNALED) + pacct->ac_flag |= AXSIG; + + task_cputime(current, &utime, &stime); + pacct->ac_utime += utime; + pacct->ac_stime += stime; + pacct->ac_minflt += current->min_flt; + pacct->ac_majflt += current->maj_flt; + spin_unlock_irq(¤t->sighand->siglock); +} + +static void slow_acct_process(struct pid_namespace *ns) +{ + for ( ; ns; ns = ns->parent) { + struct bsd_acct_struct *acct = acct_get(ns); + if (acct) { + do_acct_process(acct); + mutex_unlock(&acct->lock); + acct_put(acct); + } + } +} + +/** + * acct_process - handles process accounting for an exiting task + */ +void acct_process(void) +{ + struct pid_namespace *ns; + + /* + * This loop is safe lockless, since current is still + * alive and holds its namespace, which in turn holds + * its parent. + */ + for (ns = task_active_pid_ns(current); ns != NULL; ns = ns->parent) { + if (ns->bacct) + break; + } + if (unlikely(ns)) + slow_acct_process(ns); +} |