diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /mm/mlock.c | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'mm/mlock.c')
-rw-r--r-- | mm/mlock.c | 777 |
1 files changed, 777 insertions, 0 deletions
diff --git a/mm/mlock.c b/mm/mlock.c new file mode 100644 index 000000000..7032f6dd0 --- /dev/null +++ b/mm/mlock.c @@ -0,0 +1,777 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * linux/mm/mlock.c + * + * (C) Copyright 1995 Linus Torvalds + * (C) Copyright 2002 Christoph Hellwig + */ + +#include <linux/capability.h> +#include <linux/mman.h> +#include <linux/mm.h> +#include <linux/sched/user.h> +#include <linux/swap.h> +#include <linux/swapops.h> +#include <linux/pagemap.h> +#include <linux/pagevec.h> +#include <linux/pagewalk.h> +#include <linux/mempolicy.h> +#include <linux/syscalls.h> +#include <linux/sched.h> +#include <linux/export.h> +#include <linux/rmap.h> +#include <linux/mmzone.h> +#include <linux/hugetlb.h> +#include <linux/memcontrol.h> +#include <linux/mm_inline.h> +#include <linux/secretmem.h> + +#include "internal.h" + +struct mlock_pvec { + local_lock_t lock; + struct pagevec vec; +}; + +static DEFINE_PER_CPU(struct mlock_pvec, mlock_pvec) = { + .lock = INIT_LOCAL_LOCK(lock), +}; + +bool can_do_mlock(void) +{ + if (rlimit(RLIMIT_MEMLOCK) != 0) + return true; + if (capable(CAP_IPC_LOCK)) + return true; + return false; +} +EXPORT_SYMBOL(can_do_mlock); + +/* + * Mlocked pages are marked with PageMlocked() flag for efficient testing + * in vmscan and, possibly, the fault path; and to support semi-accurate + * statistics. + * + * An mlocked page [PageMlocked(page)] is unevictable. As such, it will + * be placed on the LRU "unevictable" list, rather than the [in]active lists. + * The unevictable list is an LRU sibling list to the [in]active lists. + * PageUnevictable is set to indicate the unevictable state. + */ + +static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec) +{ + /* There is nothing more we can do while it's off LRU */ + if (!TestClearPageLRU(page)) + return lruvec; + + lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec); + + if (unlikely(page_evictable(page))) { + /* + * This is a little surprising, but quite possible: + * PageMlocked must have got cleared already by another CPU. + * Could this page be on the Unevictable LRU? I'm not sure, + * but move it now if so. + */ + if (PageUnevictable(page)) { + del_page_from_lru_list(page, lruvec); + ClearPageUnevictable(page); + add_page_to_lru_list(page, lruvec); + __count_vm_events(UNEVICTABLE_PGRESCUED, + thp_nr_pages(page)); + } + goto out; + } + + if (PageUnevictable(page)) { + if (PageMlocked(page)) + page->mlock_count++; + goto out; + } + + del_page_from_lru_list(page, lruvec); + ClearPageActive(page); + SetPageUnevictable(page); + page->mlock_count = !!PageMlocked(page); + add_page_to_lru_list(page, lruvec); + __count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page)); +out: + SetPageLRU(page); + return lruvec; +} + +static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec) +{ + VM_BUG_ON_PAGE(PageLRU(page), page); + + lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec); + + /* As above, this is a little surprising, but possible */ + if (unlikely(page_evictable(page))) + goto out; + + SetPageUnevictable(page); + page->mlock_count = !!PageMlocked(page); + __count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page)); +out: + add_page_to_lru_list(page, lruvec); + SetPageLRU(page); + return lruvec; +} + +static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec) +{ + int nr_pages = thp_nr_pages(page); + bool isolated = false; + + if (!TestClearPageLRU(page)) + goto munlock; + + isolated = true; + lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec); + + if (PageUnevictable(page)) { + /* Then mlock_count is maintained, but might undercount */ + if (page->mlock_count) + page->mlock_count--; + if (page->mlock_count) + goto out; + } + /* else assume that was the last mlock: reclaim will fix it if not */ + +munlock: + if (TestClearPageMlocked(page)) { + __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + if (isolated || !PageUnevictable(page)) + __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages); + else + __count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages); + } + + /* page_evictable() has to be checked *after* clearing Mlocked */ + if (isolated && PageUnevictable(page) && page_evictable(page)) { + del_page_from_lru_list(page, lruvec); + ClearPageUnevictable(page); + add_page_to_lru_list(page, lruvec); + __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); + } +out: + if (isolated) + SetPageLRU(page); + return lruvec; +} + +/* + * Flags held in the low bits of a struct page pointer on the mlock_pvec. + */ +#define LRU_PAGE 0x1 +#define NEW_PAGE 0x2 +static inline struct page *mlock_lru(struct page *page) +{ + return (struct page *)((unsigned long)page + LRU_PAGE); +} + +static inline struct page *mlock_new(struct page *page) +{ + return (struct page *)((unsigned long)page + NEW_PAGE); +} + +/* + * mlock_pagevec() is derived from pagevec_lru_move_fn(): + * perhaps that can make use of such page pointer flags in future, + * but for now just keep it for mlock. We could use three separate + * pagevecs instead, but one feels better (munlocking a full pagevec + * does not need to drain mlocking pagevecs first). + */ +static void mlock_pagevec(struct pagevec *pvec) +{ + struct lruvec *lruvec = NULL; + unsigned long mlock; + struct page *page; + int i; + + for (i = 0; i < pagevec_count(pvec); i++) { + page = pvec->pages[i]; + mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE); + page = (struct page *)((unsigned long)page - mlock); + pvec->pages[i] = page; + + if (mlock & LRU_PAGE) + lruvec = __mlock_page(page, lruvec); + else if (mlock & NEW_PAGE) + lruvec = __mlock_new_page(page, lruvec); + else + lruvec = __munlock_page(page, lruvec); + } + + if (lruvec) + unlock_page_lruvec_irq(lruvec); + release_pages(pvec->pages, pvec->nr); + pagevec_reinit(pvec); +} + +void mlock_page_drain_local(void) +{ + struct pagevec *pvec; + + local_lock(&mlock_pvec.lock); + pvec = this_cpu_ptr(&mlock_pvec.vec); + if (pagevec_count(pvec)) + mlock_pagevec(pvec); + local_unlock(&mlock_pvec.lock); +} + +void mlock_page_drain_remote(int cpu) +{ + struct pagevec *pvec; + + WARN_ON_ONCE(cpu_online(cpu)); + pvec = &per_cpu(mlock_pvec.vec, cpu); + if (pagevec_count(pvec)) + mlock_pagevec(pvec); +} + +bool need_mlock_page_drain(int cpu) +{ + return pagevec_count(&per_cpu(mlock_pvec.vec, cpu)); +} + +/** + * mlock_folio - mlock a folio already on (or temporarily off) LRU + * @folio: folio to be mlocked. + */ +void mlock_folio(struct folio *folio) +{ + struct pagevec *pvec; + + local_lock(&mlock_pvec.lock); + pvec = this_cpu_ptr(&mlock_pvec.vec); + + if (!folio_test_set_mlocked(folio)) { + int nr_pages = folio_nr_pages(folio); + + zone_stat_mod_folio(folio, NR_MLOCK, nr_pages); + __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages); + } + + folio_get(folio); + if (!pagevec_add(pvec, mlock_lru(&folio->page)) || + folio_test_large(folio) || lru_cache_disabled()) + mlock_pagevec(pvec); + local_unlock(&mlock_pvec.lock); +} + +/** + * mlock_new_page - mlock a newly allocated page not yet on LRU + * @page: page to be mlocked, either a normal page or a THP head. + */ +void mlock_new_page(struct page *page) +{ + struct pagevec *pvec; + int nr_pages = thp_nr_pages(page); + + local_lock(&mlock_pvec.lock); + pvec = this_cpu_ptr(&mlock_pvec.vec); + SetPageMlocked(page); + mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages); + __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages); + + get_page(page); + if (!pagevec_add(pvec, mlock_new(page)) || + PageHead(page) || lru_cache_disabled()) + mlock_pagevec(pvec); + local_unlock(&mlock_pvec.lock); +} + +/** + * munlock_page - munlock a page + * @page: page to be munlocked, either a normal page or a THP head. + */ +void munlock_page(struct page *page) +{ + struct pagevec *pvec; + + local_lock(&mlock_pvec.lock); + pvec = this_cpu_ptr(&mlock_pvec.vec); + /* + * TestClearPageMlocked(page) must be left to __munlock_page(), + * which will check whether the page is multiply mlocked. + */ + + get_page(page); + if (!pagevec_add(pvec, page) || + PageHead(page) || lru_cache_disabled()) + mlock_pagevec(pvec); + local_unlock(&mlock_pvec.lock); +} + +static int mlock_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) + +{ + struct vm_area_struct *vma = walk->vma; + spinlock_t *ptl; + pte_t *start_pte, *pte; + struct page *page; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + if (!pmd_present(*pmd)) + goto out; + if (is_huge_zero_pmd(*pmd)) + goto out; + page = pmd_page(*pmd); + if (vma->vm_flags & VM_LOCKED) + mlock_folio(page_folio(page)); + else + munlock_page(page); + goto out; + } + + start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { + if (!pte_present(*pte)) + continue; + page = vm_normal_page(vma, addr, *pte); + if (!page || is_zone_device_page(page)) + continue; + if (PageTransCompound(page)) + continue; + if (vma->vm_flags & VM_LOCKED) + mlock_folio(page_folio(page)); + else + munlock_page(page); + } + pte_unmap(start_pte); +out: + spin_unlock(ptl); + cond_resched(); + return 0; +} + +/* + * mlock_vma_pages_range() - mlock any pages already in the range, + * or munlock all pages in the range. + * @vma - vma containing range to be mlock()ed or munlock()ed + * @start - start address in @vma of the range + * @end - end of range in @vma + * @newflags - the new set of flags for @vma. + * + * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED; + * called for munlock() and munlockall(), to clear VM_LOCKED from @vma. + */ +static void mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, vm_flags_t newflags) +{ + static const struct mm_walk_ops mlock_walk_ops = { + .pmd_entry = mlock_pte_range, + }; + + /* + * There is a slight chance that concurrent page migration, + * or page reclaim finding a page of this now-VM_LOCKED vma, + * will call mlock_vma_page() and raise page's mlock_count: + * double counting, leaving the page unevictable indefinitely. + * Communicate this danger to mlock_vma_page() with VM_IO, + * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas. + * mmap_lock is held in write mode here, so this weird + * combination should not be visible to other mmap_lock users; + * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED. + */ + if (newflags & VM_LOCKED) + newflags |= VM_IO; + WRITE_ONCE(vma->vm_flags, newflags); + + lru_add_drain(); + walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); + lru_add_drain(); + + if (newflags & VM_IO) { + newflags &= ~VM_IO; + WRITE_ONCE(vma->vm_flags, newflags); + } +} + +/* + * mlock_fixup - handle mlock[all]/munlock[all] requests. + * + * Filters out "special" vmas -- VM_LOCKED never gets set for these, and + * munlock is a no-op. However, for some special vmas, we go ahead and + * populate the ptes. + * + * For vmas that pass the filters, merge/split as appropriate. + */ +static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, + unsigned long start, unsigned long end, vm_flags_t newflags) +{ + struct mm_struct *mm = vma->vm_mm; + pgoff_t pgoff; + int nr_pages; + int ret = 0; + vm_flags_t oldflags = vma->vm_flags; + + if (newflags == oldflags || (oldflags & VM_SPECIAL) || + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || + vma_is_dax(vma) || vma_is_secretmem(vma)) + /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ + goto out; + + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, + vma->vm_file, pgoff, vma_policy(vma), + vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + if (*prev) { + vma = *prev; + goto success; + } + + if (start != vma->vm_start) { + ret = split_vma(mm, vma, start, 1); + if (ret) + goto out; + } + + if (end != vma->vm_end) { + ret = split_vma(mm, vma, end, 0); + if (ret) + goto out; + } + +success: + /* + * Keep track of amount of locked VM. + */ + nr_pages = (end - start) >> PAGE_SHIFT; + if (!(newflags & VM_LOCKED)) + nr_pages = -nr_pages; + else if (oldflags & VM_LOCKED) + nr_pages = 0; + mm->locked_vm += nr_pages; + + /* + * vm_flags is protected by the mmap_lock held in write mode. + * It's okay if try_to_unmap_one unmaps a page just after we + * set VM_LOCKED, populate_vma_page_range will bring it back. + */ + + if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) { + /* No work to do, and mlocking twice would be wrong */ + vma->vm_flags = newflags; + } else { + mlock_vma_pages_range(vma, start, end, newflags); + } +out: + *prev = vma; + return ret; +} + +static int apply_vma_lock_flags(unsigned long start, size_t len, + vm_flags_t flags) +{ + unsigned long nstart, end, tmp; + struct vm_area_struct *vma, *prev; + int error; + MA_STATE(mas, ¤t->mm->mm_mt, start, start); + + VM_BUG_ON(offset_in_page(start)); + VM_BUG_ON(len != PAGE_ALIGN(len)); + end = start + len; + if (end < start) + return -EINVAL; + if (end == start) + return 0; + vma = mas_walk(&mas); + if (!vma) + return -ENOMEM; + + if (start > vma->vm_start) + prev = vma; + else + prev = mas_prev(&mas, 0); + + for (nstart = start ; ; ) { + vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK; + + newflags |= flags; + + /* Here we know that vma->vm_start <= nstart < vma->vm_end. */ + tmp = vma->vm_end; + if (tmp > end) + tmp = end; + error = mlock_fixup(vma, &prev, nstart, tmp, newflags); + if (error) + break; + nstart = tmp; + if (nstart < prev->vm_end) + nstart = prev->vm_end; + if (nstart >= end) + break; + + vma = find_vma(prev->vm_mm, prev->vm_end); + if (!vma || vma->vm_start != nstart) { + error = -ENOMEM; + break; + } + } + return error; +} + +/* + * Go through vma areas and sum size of mlocked + * vma pages, as return value. + * Note deferred memory locking case(mlock2(,,MLOCK_ONFAULT) + * is also counted. + * Return value: previously mlocked page counts + */ +static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm, + unsigned long start, size_t len) +{ + struct vm_area_struct *vma; + unsigned long count = 0; + unsigned long end; + VMA_ITERATOR(vmi, mm, start); + + /* Don't overflow past ULONG_MAX */ + if (unlikely(ULONG_MAX - len < start)) + end = ULONG_MAX; + else + end = start + len; + + for_each_vma_range(vmi, vma, end) { + if (vma->vm_flags & VM_LOCKED) { + if (start > vma->vm_start) + count -= (start - vma->vm_start); + if (end < vma->vm_end) { + count += end - vma->vm_start; + break; + } + count += vma->vm_end - vma->vm_start; + } + } + + return count >> PAGE_SHIFT; +} + +/* + * convert get_user_pages() return value to posix mlock() error + */ +static int __mlock_posix_error_return(long retval) +{ + if (retval == -EFAULT) + retval = -ENOMEM; + else if (retval == -ENOMEM) + retval = -EAGAIN; + return retval; +} + +static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags) +{ + unsigned long locked; + unsigned long lock_limit; + int error = -ENOMEM; + + start = untagged_addr(start); + + if (!can_do_mlock()) + return -EPERM; + + len = PAGE_ALIGN(len + (offset_in_page(start))); + start &= PAGE_MASK; + + lock_limit = rlimit(RLIMIT_MEMLOCK); + lock_limit >>= PAGE_SHIFT; + locked = len >> PAGE_SHIFT; + + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + + locked += current->mm->locked_vm; + if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) { + /* + * It is possible that the regions requested intersect with + * previously mlocked areas, that part area in "mm->locked_vm" + * should not be counted to new mlock increment count. So check + * and adjust locked count if necessary. + */ + locked -= count_mm_mlocked_page_nr(current->mm, + start, len); + } + + /* check against resource limits */ + if ((locked <= lock_limit) || capable(CAP_IPC_LOCK)) + error = apply_vma_lock_flags(start, len, flags); + + mmap_write_unlock(current->mm); + if (error) + return error; + + error = __mm_populate(start, len, 0); + if (error) + return __mlock_posix_error_return(error); + return 0; +} + +SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) +{ + return do_mlock(start, len, VM_LOCKED); +} + +SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) +{ + vm_flags_t vm_flags = VM_LOCKED; + + if (flags & ~MLOCK_ONFAULT) + return -EINVAL; + + if (flags & MLOCK_ONFAULT) + vm_flags |= VM_LOCKONFAULT; + + return do_mlock(start, len, vm_flags); +} + +SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) +{ + int ret; + + start = untagged_addr(start); + + len = PAGE_ALIGN(len + (offset_in_page(start))); + start &= PAGE_MASK; + + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + ret = apply_vma_lock_flags(start, len, 0); + mmap_write_unlock(current->mm); + + return ret; +} + +/* + * Take the MCL_* flags passed into mlockall (or 0 if called from munlockall) + * and translate into the appropriate modifications to mm->def_flags and/or the + * flags for all current VMAs. + * + * There are a couple of subtleties with this. If mlockall() is called multiple + * times with different flags, the values do not necessarily stack. If mlockall + * is called once including the MCL_FUTURE flag and then a second time without + * it, VM_LOCKED and VM_LOCKONFAULT will be cleared from mm->def_flags. + */ +static int apply_mlockall_flags(int flags) +{ + MA_STATE(mas, ¤t->mm->mm_mt, 0, 0); + struct vm_area_struct *vma, *prev = NULL; + vm_flags_t to_add = 0; + + current->mm->def_flags &= VM_LOCKED_CLEAR_MASK; + if (flags & MCL_FUTURE) { + current->mm->def_flags |= VM_LOCKED; + + if (flags & MCL_ONFAULT) + current->mm->def_flags |= VM_LOCKONFAULT; + + if (!(flags & MCL_CURRENT)) + goto out; + } + + if (flags & MCL_CURRENT) { + to_add |= VM_LOCKED; + if (flags & MCL_ONFAULT) + to_add |= VM_LOCKONFAULT; + } + + mas_for_each(&mas, vma, ULONG_MAX) { + vm_flags_t newflags; + + newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK; + newflags |= to_add; + + /* Ignore errors */ + mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags); + mas_pause(&mas); + cond_resched(); + } +out: + return 0; +} + +SYSCALL_DEFINE1(mlockall, int, flags) +{ + unsigned long lock_limit; + int ret; + + if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) || + flags == MCL_ONFAULT) + return -EINVAL; + + if (!can_do_mlock()) + return -EPERM; + + lock_limit = rlimit(RLIMIT_MEMLOCK); + lock_limit >>= PAGE_SHIFT; + + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + + ret = -ENOMEM; + if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) || + capable(CAP_IPC_LOCK)) + ret = apply_mlockall_flags(flags); + mmap_write_unlock(current->mm); + if (!ret && (flags & MCL_CURRENT)) + mm_populate(0, TASK_SIZE); + + return ret; +} + +SYSCALL_DEFINE0(munlockall) +{ + int ret; + + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + ret = apply_mlockall_flags(0); + mmap_write_unlock(current->mm); + return ret; +} + +/* + * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB + * shm segments) get accounted against the user_struct instead. + */ +static DEFINE_SPINLOCK(shmlock_user_lock); + +int user_shm_lock(size_t size, struct ucounts *ucounts) +{ + unsigned long lock_limit, locked; + long memlock; + int allowed = 0; + + locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; + lock_limit = rlimit(RLIMIT_MEMLOCK); + if (lock_limit != RLIM_INFINITY) + lock_limit >>= PAGE_SHIFT; + spin_lock(&shmlock_user_lock); + memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); + + if ((memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) { + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); + goto out; + } + if (!get_ucounts(ucounts)) { + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); + allowed = 0; + goto out; + } + allowed = 1; +out: + spin_unlock(&shmlock_user_lock); + return allowed; +} + +void user_shm_unlock(size_t size, struct ucounts *ucounts) +{ + spin_lock(&shmlock_user_lock); + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT); + spin_unlock(&shmlock_user_lock); + put_ucounts(ucounts); +} |