From fdbcb2a6d6778e0b91938529694e5f40b4a66130 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:19 -0700 Subject: mm/memcg: move mod_objcg_state() to memcontrol.c Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object or a big 4k object at module init time with a batch size of 4 (4 kmalloc's followed by 4 kfree's) is used for benchmarking. The benchmarking tool was run on a kernel based on linux-next-20210419. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. The small object test exercises mainly the object stock charging and vmstat update code paths. The large object test also exercises the refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code paths. With memory accounting disabled, the run time was 3.130s with both small object big object tests. With memory accounting enabled, both cgroup v1 and v2 showed similar results in the small object test. The performance results of the large object test, however, differed between cgroup v1 and v2. The execution times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead %age 1 %age 2 --------------- -------- ------------------- ------ ------ Small 32-byte object: None 11.634s 8.504s 100.0% 271.7% 1-2 9.425s 6.295s 74.0% 201.1% 1-3 9.708s 6.578s 77.4% 210.2% 1-4 8.062s 4.932s 58.0% 157.6% Large 4k object (v2): None 22.107s 18.977s 100.0% 606.3% 1-2 20.960s 17.830s 94.0% 569.6% 1-3 14.238s 11.108s 58.5% 354.9% 1-4 11.329s 8.199s 43.2% 261.9% Large 4k object (v1): None 36.807s 33.677s 100.0% 1075.9% 1-2 36.648s 33.518s 99.5% 1070.9% 1-3 22.345s 19.215s 57.1% 613.9% 1-4 18.662s 15.532s 46.1% 496.2% N.B. %age 1 = overhead/unpatched overhead %age 2 = overhead/accounting disabled time Patch 2 (vmstat data stock caching) helps in both the small object test and the large v2 object test. It doesn't help much in v1 big object test. Patch 3 (refill_obj_stock improvement) does help the small object test but offer significant performance improvement for the large object test (both v1 and v2). Patch 4 (eliminating irq disable/enable) helps in all test cases. To test for the extreme case, a multi-threaded kmalloc/kfree microbenchmark was run on the 2-socket 48-core 96-thread system with 96 testing threads in the same memcg doing kmalloc+kfree of a 4k object with accounting enabled for 10s. The total number of kmalloc+kfree done in kilo operations per second (kops/s) were as follows: Applied patches v1 kops/s v1 change v2 kops/s v2 change --------------- --------- --------- --------- --------- None 3,520 1.00X 6,242 1.00X 1-2 4,304 1.22X 8,478 1.36X 1-3 4,731 1.34X 418,142 66.99X 1-4 4,587 1.30X 438,838 70.30X With memory accounting disabled, the kmalloc/kfree rate was 1,481,291 kop/s. This test shows how significant the memory accouting overhead can be in some extreme situations. For this multithreaded test, the improvement from patch 2 mainly comes from the conditional atomic xchg of objcg->nr_charged_bytes in mod_objcg_state(). By using an unconditional xchg, the operation rates were similar to the unpatched kernel. Patch 3 elminates the single highly contended cacheline of objcg->nr_charged_bytes for cgroup v2 leading to a huge performance improvement. Cgroup v1, however, still has another highly contended cacheline in the shared page counter &memcg->kmem. So the improvement is only modest. Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as eliminating the irq_disable/irq_enable overhead seems to aggravate the cacheline contention. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ This patch (of 4): mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com Signed-off-by: Waiman Long Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Alex Shi Cc: Chris Down Cc: Christoph Lameter Cc: David Rientjes Cc: Joonsoo Kim Cc: Masayoshi Mizuma Cc: Matthew Wilcox Cc: Michal Hocko Cc: Muchun Song Cc: Pekka Enberg Cc: Tejun Heo Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Wei Yang Cc: Xing Zhengjun Cc: Yafang Shao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 64ada9e650a5..7cd7187a017c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,6 +782,19 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} + /** * __count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup -- cgit 1.4.1 From 68ac5b3c8db2fda00af594eca4100aceaf927c0e Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:23 -0700 Subject: mm/memcg: cache vmstat data in percpu memcg_stock_pcp Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.com Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt Cc: Alex Shi Cc: Chris Down Cc: Christoph Lameter Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Masayoshi Mizuma Cc: Matthew Wilcox Cc: Michal Hocko Cc: Muchun Song Cc: Pekka Enberg Cc: Roman Gushchin Cc: Tejun Heo Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Wei Yang Cc: Xing Zhengjun Cc: Yafang Shao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 3 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7cd7187a017c..b4624580d18a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,8 +782,9 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, - enum node_stat_item idx, int nr) +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, + struct pglist_data *pgdat, + enum node_stat_item idx, int nr) { struct mem_cgroup *memcg; struct lruvec *lruvec; @@ -791,7 +792,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + __mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -2059,7 +2060,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int nr_slab_reclaimable_b; + int nr_slab_unreclaimable_b; #endif struct work_struct work; @@ -3008,6 +3012,67 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) obj_cgroup_put(objcg); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct memcg_stock_pcp *stock; + unsigned long flags; + int *bytes; + + local_irq_save(flags); + stock = this_cpu_ptr(&memcg_stock); + + /* + * Save vmstat data in stock and skip vmstat array update unless + * accumulating over a page of vmstat data or when pgdat or idx + * changes. + */ + if (stock->cached_objcg != objcg) { + drain_obj_stock(stock); + obj_cgroup_get(objcg); + stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) + ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; + stock->cached_objcg = objcg; + stock->cached_pgdat = pgdat; + } else if (stock->cached_pgdat != pgdat) { + /* Flush the existing cached vmstat data */ + if (stock->nr_slab_reclaimable_b) { + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_RECLAIMABLE_B, + stock->nr_slab_reclaimable_b); + stock->nr_slab_reclaimable_b = 0; + } + if (stock->nr_slab_unreclaimable_b) { + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_UNRECLAIMABLE_B, + stock->nr_slab_unreclaimable_b); + stock->nr_slab_unreclaimable_b = 0; + } + stock->cached_pgdat = pgdat; + } + + bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? &stock->nr_slab_reclaimable_b + : &stock->nr_slab_unreclaimable_b; + /* + * Even for large object >= PAGE_SIZE, the vmstat data will still be + * cached locally at least once before pushing it out. + */ + if (!*bytes) { + *bytes = nr; + nr = 0; + } else { + *bytes += nr; + if (abs(*bytes) > PAGE_SIZE) { + nr = *bytes; + *bytes = 0; + } else { + nr = 0; + } + } + if (nr) + mod_objcg_mlstate(objcg, pgdat, idx, nr); + + local_irq_restore(flags); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; @@ -3055,6 +3120,25 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) stock->nr_bytes = 0; } + /* + * Flush the vmstat data in current stock + */ + if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) { + if (stock->nr_slab_reclaimable_b) { + mod_objcg_mlstate(old, stock->cached_pgdat, + NR_SLAB_RECLAIMABLE_B, + stock->nr_slab_reclaimable_b); + stock->nr_slab_reclaimable_b = 0; + } + if (stock->nr_slab_unreclaimable_b) { + mod_objcg_mlstate(old, stock->cached_pgdat, + NR_SLAB_UNRECLAIMABLE_B, + stock->nr_slab_unreclaimable_b); + stock->nr_slab_unreclaimable_b = 0; + } + stock->cached_pgdat = NULL; + } + obj_cgroup_put(old); stock->cached_objcg = NULL; } -- cgit 1.4.1 From 5387c90490f7f42df3209154ca955a453ee01b41 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:27 -0700 Subject: mm/memcg: improve refill_obj_stock() performance There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new allow_uncharge flag is added to refill_obj_stock() which will be set to false when called from obj_cgroup_charge() so that an uncharge_pages() call won't be issued right after a charge_pages() call unless the objcg changes. A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core 96-thread x86-64 system with 96 testing threads were run. Before this patch, the total number of kilo kmalloc+kfree operations done for a 4k large object by all the testing threads per second were 4,304 kops/s (cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This represents a performance improvement of 1.10X (cgroup v1) and 49.3X (cgroup v2). Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.com Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt Cc: Alex Shi Cc: Chris Down Cc: Christoph Lameter Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Masayoshi Mizuma Cc: Matthew Wilcox Cc: Michal Hocko Cc: Muchun Song Cc: Pekka Enberg Cc: Roman Gushchin Cc: Tejun Heo Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Wei Yang Cc: Xing Zhengjun Cc: Yafang Shao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 48 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 35 insertions(+), 13 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b4624580d18a..17d38c7f630f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3157,10 +3157,12 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, + bool allow_uncharge) { struct memcg_stock_pcp *stock; unsigned long flags; + unsigned int nr_pages = 0; local_irq_save(flags); @@ -3169,14 +3171,21 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) drain_obj_stock(stock); obj_cgroup_get(objcg); stock->cached_objcg = objcg; - stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0); + stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) + ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; + allow_uncharge = true; /* Allow uncharge when objcg changes */ } stock->nr_bytes += nr_bytes; - if (stock->nr_bytes > PAGE_SIZE) - drain_obj_stock(stock); + if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) { + nr_pages = stock->nr_bytes >> PAGE_SHIFT; + stock->nr_bytes &= (PAGE_SIZE - 1); + } local_irq_restore(flags); + + if (nr_pages) + obj_cgroup_uncharge_pages(objcg, nr_pages); } int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) @@ -3188,14 +3197,27 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) return 0; /* - * In theory, memcg->nr_charged_bytes can have enough + * In theory, objcg->nr_charged_bytes can have enough * pre-charged bytes to satisfy the allocation. However, - * flushing memcg->nr_charged_bytes requires two atomic - * operations, and memcg->nr_charged_bytes can't be big, - * so it's better to ignore it and try grab some new pages. - * memcg->nr_charged_bytes will be flushed in - * refill_obj_stock(), called from this function or - * independently later. + * flushing objcg->nr_charged_bytes requires two atomic + * operations, and objcg->nr_charged_bytes can't be big. + * The shared objcg->nr_charged_bytes can also become a + * performance bottleneck if all tasks of the same memcg are + * trying to update it. So it's better to ignore it and try + * grab some new pages. The stock's nr_bytes will be flushed to + * objcg->nr_charged_bytes later on when objcg changes. + * + * The stock's nr_bytes may contain enough pre-charged bytes + * to allow one less page from being charged, but we can't rely + * on the pre-charged bytes not being changed outside of + * consume_obj_stock() or refill_obj_stock(). So ignore those + * pre-charged bytes as well when charging pages. To avoid a + * page uncharge right after a page charge, we set the + * allow_uncharge flag to false when calling refill_obj_stock() + * to temporarily allow the pre-charged bytes to exceed the page + * size limit. The maximum reachable value of the pre-charged + * bytes is (sizeof(object) + PAGE_SIZE - 2) if there is no data + * race. */ nr_pages = size >> PAGE_SHIFT; nr_bytes = size & (PAGE_SIZE - 1); @@ -3205,14 +3227,14 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); if (!ret && nr_bytes) - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, false); return ret; } void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) { - refill_obj_stock(objcg, size); + refill_obj_stock(objcg, size, true); } #endif /* CONFIG_MEMCG_KMEM */ -- cgit 1.4.1 From 559271146efc0bf125e6390191f683eab884e4a1 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:30 -0700 Subject: mm/memcg: optimize user context object stock access Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two sets of object stocks (in the new obj_stock structure) for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. [longman@redhat.com: fix potential uninitialized variable warning] Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.com Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Vlastimil Babka Cc: Roman Gushchin Cc: Muchun Song Cc: Alex Shi Cc: Chris Down Cc: Yafang Shao Cc: Wei Yang Cc: Masayoshi Mizuma Cc: Xing Zhengjun Cc: Matthew Wilcox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 100 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 72 insertions(+), 28 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17d38c7f630f..97f76ce04eae 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,6 +782,10 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } +/* + * mod_objcg_mlstate() may be called with irq enabled, so + * mod_memcg_lruvec_state() should be used. + */ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) @@ -792,7 +796,7 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -2054,17 +2058,23 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2074,12 +2084,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2089,6 +2099,41 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ +static inline struct obj_stock *get_obj_stock(unsigned long *pflags) +{ + struct memcg_stock_pcp *stock; + + if (likely(in_task())) { + *pflags = 0UL; + preempt_disable(); + stock = this_cpu_ptr(&memcg_stock); + return &stock->task_obj; + } + + local_irq_save(*pflags); + stock = this_cpu_ptr(&memcg_stock); + return &stock->irq_obj; +} + +static inline void put_obj_stock(unsigned long flags) +{ + if (likely(in_task())) + preempt_enable(); + else + local_irq_restore(flags); +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2155,7 +2200,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(&memcg_stock); - drain_obj_stock(stock); + drain_obj_stock(&stock->irq_obj); + if (in_task()) + drain_obj_stock(&stock->task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); @@ -3015,13 +3062,10 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { - struct memcg_stock_pcp *stock; unsigned long flags; + struct obj_stock *stock = get_obj_stock(&flags); int *bytes; - local_irq_save(flags); - stock = this_cpu_ptr(&memcg_stock); - /* * Save vmstat data in stock and skip vmstat array update unless * accumulating over a page of vmstat data or when pgdat or idx @@ -3070,29 +3114,26 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, if (nr) mod_objcg_mlstate(objcg, pgdat, idx, nr); - local_irq_restore(flags); + put_obj_stock(flags); } static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; unsigned long flags; + struct obj_stock *stock = get_obj_stock(&flags); bool ret = false; - local_irq_save(flags); - - stock = this_cpu_ptr(&memcg_stock); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; } - local_irq_restore(flags); + put_obj_stock(flags); return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock *stock) { struct obj_cgroup *old = stock->cached_objcg; @@ -3148,8 +3189,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); + if (in_task() && stock->task_obj.cached_objcg) { + memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); + if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) + return true; + } + if (stock->irq_obj.cached_objcg) { + memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) return true; } @@ -3160,13 +3206,10 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { - struct memcg_stock_pcp *stock; unsigned long flags; + struct obj_stock *stock = get_obj_stock(&flags); unsigned int nr_pages = 0; - local_irq_save(flags); - - stock = this_cpu_ptr(&memcg_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3182,7 +3225,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_irq_restore(flags); + put_obj_stock(flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -6790,6 +6833,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) unsigned long nr_pages; struct mem_cgroup *memcg; struct obj_cgroup *objcg; + bool use_objcg = PageMemcgKmem(page); VM_BUG_ON_PAGE(PageLRU(page), page); @@ -6798,7 +6842,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) * page memcg or objcg at this point, we have fully * exclusive access to the page. */ - if (PageMemcgKmem(page)) { + if (use_objcg) { objcg = __page_objcg(page); /* * This get matches the put at the end of the function and @@ -6826,7 +6870,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) nr_pages = compound_nr(page); - if (PageMemcgKmem(page)) { + if (use_objcg) { ug->nr_memory += nr_pages; ug->nr_kmem += nr_pages; -- cgit 1.4.1 From 41eb5df1cbc9b302fc263ad7c9f38cfc38b4df61 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:34 -0700 Subject: mm: memcg/slab: properly set up gfp flags for objcg pointer array Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4. Since the merging of the new slab memory controller in v5.9, the page structure stores a pointer to objcg pointer array for slab pages. When the slab has no used objects, it can be freed in free_slab() which will call kfree() to free the objcg pointer array in memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer array is the last used object in its slab, that slab may then be freed which may caused kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. In fact, we have a reproducer that can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following kfree() loop recursively called 74 times: [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I also found an issue on the allocation side. If the objcg pointer array happen to come from the same slab or a circular dependency linkage is formed with multiple slabs, those affected slabs can never be freed again. This patch series addresses these two issues by introducing a new set of kmalloc-cg- caches split from kmalloc- caches. The new set will only contain non-reclaimable and non-dma objects that are accounted in memory cgroups whereas the old set are now for unaccounted objects only. By making this split, all the objcg pointer arrays will come from the kmalloc- caches, but those caches will never hold any objcg pointer array. As a result, deeply nested kfree() call and the unfreeable slab problems are now gone. This patch (of 4): Since the merging of the new slab memory controller in v5.9, the page structure may store a pointer to obj_cgroup pointer array for slab pages. Currently, only the __GFP_ACCOUNT bit is masked off. However, the array is not readily reclaimable and doesn't need to come from the DMA buffer. So those GFP bits should be masked off as well. Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure that it is consistently applied no matter where it is called. Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages") Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Reviewed-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 8 ++++++++ mm/slab.h | 1 - 2 files changed, 8 insertions(+), 1 deletion(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97f76ce04eae..2508bd97349c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2803,6 +2803,13 @@ retry: } #ifdef CONFIG_MEMCG_KMEM +/* + * The allocated objcg pointers array is not accounted directly. + * Moreover, it should not come from DMA buffer and is not readily + * reclaimable. So those GFP bits should be masked off. + */ +#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT) + int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, gfp_t gfp, bool new_page) { @@ -2810,6 +2817,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, unsigned long memcg_data; void *vec; + gfp &= ~OBJCGS_CLEAR_MASK; vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp, page_to_nid(page)); if (!vec) diff --git a/mm/slab.h b/mm/slab.h index f2c32f24da95..7b60ef2f32c3 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -298,7 +298,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, if (!memcg_kmem_enabled() || !objcg) return; - flags &= ~__GFP_ACCOUNT; for (i = 0; i < size; i++) { if (likely(p[i])) { page = virt_to_head_page(p[i]); -- cgit 1.4.1 From 494c1dfe855ec1f70f89552fce5eadf4a1717552 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:38 -0700 Subject: mm: memcg/slab: create a new set of kmalloc-cg- caches There are currently two problems in the way the objcg pointer array (memcg_data) in the page structure is being allocated and freed. On its allocation, it is possible that the allocated objcg pointer array comes from the same slab that requires memory accounting. If this happens, the slab will never become empty again as there is at least one object left (the obj_cgroup array) in the slab. When it is freed, the objcg pointer array object may be the last one in its slab and hence causes kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. One way to solve this problem is to split the kmalloc- caches (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc- (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of kmalloc-cg- (KMALLOC_CGROUP) caches for accounted objects only. All the other caches can still allow a mix of accounted and unaccounted objects. With this change, all the objcg pointer array objects will come from KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So both the recursive kfree() problem and non-freeable slab problem are gone. Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have mixed accounted and unaccounted objects, this will slightly reduce the number of objcg pointer arrays that need to be allocated and save a bit of memory. On the other hand, creating a new set of kmalloc caches does have the effect of reducing cache utilization. So it is properly a wash. The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches() will include the newly added caches without change. [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem] Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com [akpm@linux-foundation.org: un-fat-finger v5 delta creation] [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches] Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.com Signed-off-by: Waiman Long Signed-off-by: Vlastimil Babka Suggested-by: Vlastimil Babka Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Christoph Lameter Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Pekka Enberg Cc: Vladimir Davydov [longman@redhat.com: fix for CONFIG_ZONE_DMA=n] Suggested-by: Roman Gushchin Reviewed-by: Vlastimil Babka Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++--------- mm/internal.h | 5 +++++ mm/memcontrol.c | 2 +- mm/slab_common.c | 32 +++++++++++++++++++++++--------- 4 files changed, 62 insertions(+), 19 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/include/linux/slab.h b/include/linux/slab.h index bc9ab3a5a017..083f3ce550bc 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -305,9 +305,21 @@ static inline void __check_heap_object(const void *ptr, unsigned long n, /* * Whenever changing this, take care of that kmalloc_type() and * create_kmalloc_caches() still work as intended. + * + * KMALLOC_NORMAL can contain only unaccounted objects whereas KMALLOC_CGROUP + * is for accounted but unreclaimable and non-dma objects. All the other + * kmem caches can have both accounted and unaccounted objects. */ enum kmalloc_cache_type { KMALLOC_NORMAL = 0, +#ifndef CONFIG_ZONE_DMA + KMALLOC_DMA = KMALLOC_NORMAL, +#endif +#ifndef CONFIG_MEMCG_KMEM + KMALLOC_CGROUP = KMALLOC_NORMAL, +#else + KMALLOC_CGROUP, +#endif KMALLOC_RECLAIM, #ifdef CONFIG_ZONE_DMA KMALLOC_DMA, @@ -319,24 +331,36 @@ enum kmalloc_cache_type { extern struct kmem_cache * kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; +/* + * Define gfp bits that should not be set for KMALLOC_NORMAL. + */ +#define KMALLOC_NOT_NORMAL_BITS \ + (__GFP_RECLAIMABLE | \ + (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \ + (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0)) + static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { -#ifdef CONFIG_ZONE_DMA /* * The most common case is KMALLOC_NORMAL, so test for it - * with a single branch for both flags. + * with a single branch for all the relevant flags. */ - if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) + if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0)) return KMALLOC_NORMAL; /* - * At least one of the flags has to be set. If both are, __GFP_DMA - * is more important. + * At least one of the flags has to be set. Their priorities in + * decreasing order are: + * 1) __GFP_DMA + * 2) __GFP_RECLAIMABLE + * 3) __GFP_ACCOUNT */ - return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; -#else - return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; -#endif + if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA)) + return KMALLOC_DMA; + if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE)) + return KMALLOC_RECLAIM; + else + return KMALLOC_CGROUP; } /* diff --git a/mm/internal.h b/mm/internal.h index e8fdb531f887..2946dfa0f245 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -115,6 +115,11 @@ extern void putback_lru_page(struct page *page); */ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); +/* + * in mm/memcontrol.c: + */ +extern bool cgroup_memory_nokmem; + /* * in mm/page_alloc.c */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2508bd97349c..b913950b9f64 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -83,7 +83,7 @@ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); static bool cgroup_memory_nosocket; /* Kernel memory accounting disabled? */ -static bool cgroup_memory_nokmem; +bool cgroup_memory_nokmem; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP diff --git a/mm/slab_common.c b/mm/slab_common.c index 6c0db9f9bd8a..db3f356bf725 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -738,21 +738,25 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags) } #ifdef CONFIG_ZONE_DMA -#define INIT_KMALLOC_INFO(__size, __short_size) \ -{ \ - .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ - .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ - .name[KMALLOC_DMA] = "dma-kmalloc-" #__short_size, \ - .size = __size, \ -} +#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz, +#else +#define KMALLOC_DMA_NAME(sz) +#endif + +#ifdef CONFIG_MEMCG_KMEM +#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz, #else +#define KMALLOC_CGROUP_NAME(sz) +#endif + #define INIT_KMALLOC_INFO(__size, __short_size) \ { \ .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ + KMALLOC_CGROUP_NAME(__short_size) \ + KMALLOC_DMA_NAME(__short_size) \ .size = __size, \ } -#endif /* * kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time. @@ -838,8 +842,15 @@ void __init setup_kmalloc_cache_index_table(void) static void __init new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) { - if (type == KMALLOC_RECLAIM) + if (type == KMALLOC_RECLAIM) { flags |= SLAB_RECLAIM_ACCOUNT; + } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) { + if (cgroup_memory_nokmem) { + kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx]; + return; + } + flags |= SLAB_ACCOUNT; + } kmalloc_caches[type][idx] = create_kmalloc_cache( kmalloc_info[idx].name[type], @@ -857,6 +868,9 @@ void __init create_kmalloc_caches(slab_flags_t flags) int i; enum kmalloc_cache_type type; + /* + * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined + */ for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) { for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { if (!kmalloc_caches[type][i]) -- cgit 1.4.1 From c5c8b16b596e15471db22ed8ed10aafbf1a11878 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:44 -0700 Subject: mm: memcontrol: fix root_mem_cgroup charging The below scenario can cause the page counters of the root_mem_cgroup to be out of balance. CPU0: CPU1: objcg = get_obj_cgroup_from_current() obj_cgroup_charge_pages(objcg) memcg_reparent_objcgs() // reparent to root_mem_cgroup WRITE_ONCE(iter->memcg, parent) // memcg == root_mem_cgroup memcg = get_mem_cgroup_from_objcg(objcg) // do not charge to the root_mem_cgroup try_charge(memcg) obj_cgroup_uncharge_pages(objcg) memcg = get_mem_cgroup_from_objcg(objcg) // uncharge from the root_mem_cgroup refill_stock(memcg) drain_stock(memcg) page_counter_uncharge(&memcg->memory) get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so we never explicitly charge the root_mem_cgroup. And it's not going to change. It's all about a race when we got an obj_cgroup pointing at some non-root memcg, but before we were able to charge it, the cgroup was gone, objcg was reparented to the root and so we're skipping the charging. Then we store the objcg pointer and later use to uncharge the root_mem_cgroup. This can cause the page counter to be less than the actual value. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt Cc: Xiongchun Duan Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b913950b9f64..70690fdf53cc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2568,8 +2568,8 @@ out: css_put(&memcg->css); } -static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages) +static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, + unsigned int nr_pages) { unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); int nr_retries = MAX_RECLAIM_RETRIES; @@ -2581,8 +2581,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool drained = false; unsigned long pflags; - if (mem_cgroup_is_root(memcg)) - return 0; retry: if (consume_stock(memcg, nr_pages)) return 0; @@ -2762,6 +2760,15 @@ done_restock: return 0; } +static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, + unsigned int nr_pages) +{ + if (mem_cgroup_is_root(memcg)) + return 0; + + return try_charge_memcg(memcg, gfp_mask, nr_pages); +} + #if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU) static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { @@ -2997,7 +3004,7 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp, memcg = get_mem_cgroup_from_objcg(objcg); - ret = try_charge(memcg, gfp, nr_pages); + ret = try_charge_memcg(memcg, gfp, nr_pages); if (ret) goto out; -- cgit 1.4.1 From 8dc87c7d1fec8851925ca96ade0d65d3dcf89cce Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:47 -0700 Subject: mm: memcontrol: fix page charging in page replacement Patch series "memcontrol code cleanup and simplification", v3. This patch (of 8): The pages aren't accounted at the root level, so do not charge the page to the root memcg in page replacement. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 70690fdf53cc..239f69ed1ac1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6984,9 +6984,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) /* Force-charge the new page. The old one will be freed soon */ nr_pages = thp_nr_pages(newpage); - page_counter_charge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_charge(&memcg->memsw, nr_pages); + if (!mem_cgroup_is_root(memcg)) { + page_counter_charge(&memcg->memory, nr_pages); + if (do_memsw_account()) + page_counter_charge(&memcg->memsw, nr_pages); + } css_get(&memcg->css); commit_charge(newpage, memcg); -- cgit 1.4.1 From 2884b6b7eed4fc14c0630fb16e56a4c66c786d33 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:50 -0700 Subject: mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm When mm is NULL, we do not need to hold rcu lock and call css_tryget for the root memcg. And we also do not need to check !mm in every loop of while. So bail out early when !mm. Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 239f69ed1ac1..babbaf49ee36 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -919,20 +919,23 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) if (mem_cgroup_disabled()) return NULL; + /* + * Page cache insertions can happen without an + * actual mm context, e.g. during disk probing + * on boot, loopback IO, acct() writes etc. + * + * No need to css_get on root memcg as the reference + * counting is disabled on the root level in the + * cgroup core. See CSS_NO_REF. + */ + if (unlikely(!mm)) + return root_mem_cgroup; + rcu_read_lock(); do { - /* - * Page cache insertions can happen without an - * actual mm context, e.g. during disk probing - * on boot, loopback IO, acct() writes etc. - */ - if (unlikely(!mm)) + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); + if (unlikely(!memcg)) memcg = root_mem_cgroup; - else { - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (unlikely(!memcg)) - memcg = root_mem_cgroup; - } } while (!css_tryget(&memcg->css)); rcu_read_unlock(); return memcg; -- cgit 1.4.1 From a984226f457f849eb9c4ce727eeaa3b5080597d8 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:53 -0700 Subject: mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 10 +++++----- mm/compaction.c | 2 +- mm/memcontrol.c | 9 +++------ mm/swap.c | 2 +- mm/workingset.c | 2 +- 5 files changed, 11 insertions(+), 14 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c193be760709..f2a5aaba3577 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -743,13 +743,12 @@ out: /** * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page * @page: the page - * @pgdat: pgdat of the page * * This function relies on page->mem_cgroup being stable. */ -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, - struct pglist_data *pgdat) +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) { + pg_data_t *pgdat = page_pgdat(page); struct mem_cgroup *memcg = page_memcg(page); VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page); @@ -1221,9 +1220,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, return &pgdat->__lruvec; } -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, - struct pglist_data *pgdat) +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) { + pg_data_t *pgdat = page_pgdat(page); + return &pgdat->__lruvec; } diff --git a/mm/compaction.c b/mm/compaction.c index 84fde270ae74..7d41b58fb17c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1028,7 +1028,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (!TestClearPageLRU(page)) goto isolate_fail_put; - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = mem_cgroup_page_lruvec(page); /* If we already hold the lock, we can skip some rechecking */ if (lruvec != locked) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index babbaf49ee36..946a9a483e71 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1199,9 +1199,8 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) struct lruvec *lock_page_lruvec(struct page *page) { struct lruvec *lruvec; - struct pglist_data *pgdat = page_pgdat(page); - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = mem_cgroup_page_lruvec(page); spin_lock(&lruvec->lru_lock); lruvec_memcg_debug(lruvec, page); @@ -1212,9 +1211,8 @@ struct lruvec *lock_page_lruvec(struct page *page) struct lruvec *lock_page_lruvec_irq(struct page *page) { struct lruvec *lruvec; - struct pglist_data *pgdat = page_pgdat(page); - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = mem_cgroup_page_lruvec(page); spin_lock_irq(&lruvec->lru_lock); lruvec_memcg_debug(lruvec, page); @@ -1225,9 +1223,8 @@ struct lruvec *lock_page_lruvec_irq(struct page *page) struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) { struct lruvec *lruvec; - struct pglist_data *pgdat = page_pgdat(page); - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = mem_cgroup_page_lruvec(page); spin_lock_irqsave(&lruvec->lru_lock, *flags); lruvec_memcg_debug(lruvec, page); diff --git a/mm/swap.c b/mm/swap.c index dfb48cf9c2c9..18cc9e63515b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -313,7 +313,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) void lru_note_cost_page(struct page *page) { - lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), + lru_note_cost(mem_cgroup_page_lruvec(page), page_is_file_lru(page), thp_nr_pages(page)); } diff --git a/mm/workingset.c b/mm/workingset.c index b7cdeca5a76d..4f7a306ce75a 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -408,7 +408,7 @@ void workingset_activation(struct page *page) memcg = page_memcg_rcu(page); if (!mem_cgroup_disabled() && !memcg) goto out; - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + lruvec = mem_cgroup_page_lruvec(page); workingset_age_nonresident(lruvec, thp_nr_pages(page)); out: rcu_read_unlock(); -- cgit 1.4.1 From 9838354e16a2a920d5a228559850d10fa588a18d Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:38:03 -0700 Subject: mm: memcontrol: simplify the logic of objcg pinning memcg The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the css_set_lock. We do not need to care about objcg->memcg being released in the process of obj_cgroup_release(). So there is no need to pin memcg before releasing objcg. Remove those pinning logic to simplfy the code. There are only two places that modifies the objcg->memcg. One is the initialization to objcg->memcg in the memcg_online_kmem(), another is objcgs reparenting in the memcg_reparent_objcgs(). It is also impossible for the two to run in parallel. So xchg() is unnecessary and it is enough to use WRITE_ONCE(). Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 946a9a483e71..c79b6926fe83 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -261,7 +261,6 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, static void obj_cgroup_release(struct percpu_ref *ref) { struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt); - struct mem_cgroup *memcg; unsigned int nr_bytes; unsigned int nr_pages; unsigned long flags; @@ -291,11 +290,9 @@ static void obj_cgroup_release(struct percpu_ref *ref) nr_pages = nr_bytes >> PAGE_SHIFT; spin_lock_irqsave(&css_set_lock, flags); - memcg = obj_cgroup_memcg(objcg); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); list_del(&objcg->list); - mem_cgroup_put(memcg); spin_unlock_irqrestore(&css_set_lock, flags); percpu_ref_exit(ref); @@ -330,17 +327,12 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg, spin_lock_irq(&css_set_lock); - /* Move active objcg to the parent's list */ - xchg(&objcg->memcg, parent); - css_get(&parent->css); - list_add(&objcg->list, &parent->objcg_list); - - /* Move already reparented objcgs to the parent's list */ - list_for_each_entry(iter, &memcg->objcg_list, list) { - css_get(&parent->css); - xchg(&iter->memcg, parent); - css_put(&memcg->css); - } + /* 1) Ready to reparent active objcg. */ + list_add(&objcg->list, &memcg->objcg_list); + /* 2) Reparent active objcg and already reparented objcgs to parent. */ + list_for_each_entry(iter, &memcg->objcg_list, list) + WRITE_ONCE(iter->memcg, parent); + /* 3) Move already reparented objcgs to the parent's list */ list_splice(&memcg->objcg_list, &parent->objcg_list); spin_unlock_irq(&css_set_lock); -- cgit 1.4.1 From 271dd6b1f636a99a3a77889935296c063f5a3cbe Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:38:06 -0700 Subject: mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock The css_set_lock is used to guard the list of inherited objcgs. So there is no need to uncharge kernel memory under css_set_lock. Just move it out of the lock. Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/memcontrol.c') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c79b6926fe83..f7a552eb3e9d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -289,9 +289,10 @@ static void obj_cgroup_release(struct percpu_ref *ref) WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); nr_pages = nr_bytes >> PAGE_SHIFT; - spin_lock_irqsave(&css_set_lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); + + spin_lock_irqsave(&css_set_lock, flags); list_del(&objcg->list); spin_unlock_irqrestore(&css_set_lock, flags); -- cgit 1.4.1 From 04f94e3fbe1afcb815d7c7ace78c6779772aa837 Mon Sep 17 00:00:00 2001 From: Dan Schatzberg Date: Mon, 28 Jun 2021 19:38:18 -0700 Subject: mm: charge active memcg when no mm is set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit set_active_memcg() worked for kernel allocations but was silently ignored for user pages. This patch establishes a precedence order for who gets charged: 1. If there is a memcg associated with the page already, that memcg is charged. This happens during swapin. 2. If an explicit mm is passed, mm->memcg is charged. This happens during page faults, which can be triggered in remote VMs (eg gup). 3. Otherwise consult the current process context. If there is an active_memcg, use that. Otherwise, current->mm->memcg. Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would always charge the root cgroup. Now it looks up the active_memcg first (falling back to charging the root cgroup if not set). Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg Acked-by: Johannes Weiner Acked-by: Tejun Heo Acked-by: Chris Down Acked-by: Jens Axboe Reviewed-by: Shakeel Butt Reviewed-by: Michal Koutný Cc: Michal Hocko Cc: Ming Lei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- mm/memcontrol.c | 41 +++++++++++++++++++++++++++-------------- mm/shmem.c | 4 ++-- 3 files changed, 30 insertions(+), 17 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/mm/filemap.c b/mm/filemap.c index 66f7e9fdfbc4..ac82a93d4f38 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -872,7 +872,7 @@ noinline int __add_to_page_cache_locked(struct page *page, page->index = offset; if (!huge) { - error = mem_cgroup_charge(page, current->mm, gfp); + error = mem_cgroup_charge(page, NULL, gfp); if (error) goto error; charged = true; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f7a552eb3e9d..8f3244e59b30 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -897,13 +897,24 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) } EXPORT_SYMBOL(mem_cgroup_from_task); +static __always_inline struct mem_cgroup *active_memcg(void) +{ + if (in_interrupt()) + return this_cpu_read(int_active_memcg); + else + return current->active_memcg; +} + /** * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg. * @mm: mm from which memcg should be extracted. It can be NULL. * - * Obtain a reference on mm->memcg and returns it if successful. Otherwise - * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is - * returned. + * Obtain a reference on mm->memcg and returns it if successful. If mm + * is NULL, then the memcg is chosen as follows: + * 1) The active memcg, if set. + * 2) current->mm->memcg, if available + * 3) root memcg + * If mem_cgroup is disabled, NULL is returned. */ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) { @@ -921,8 +932,17 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) * counting is disabled on the root level in the * cgroup core. See CSS_NO_REF. */ - if (unlikely(!mm)) - return root_mem_cgroup; + if (unlikely(!mm)) { + memcg = active_memcg(); + if (unlikely(memcg)) { + /* remote memcg must hold a ref */ + css_get(&memcg->css); + return memcg; + } + mm = current->mm; + if (unlikely(!mm)) + return root_mem_cgroup; + } rcu_read_lock(); do { @@ -935,14 +955,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) } EXPORT_SYMBOL(get_mem_cgroup_from_mm); -static __always_inline struct mem_cgroup *active_memcg(void) -{ - if (in_interrupt()) - return this_cpu_read(int_active_memcg); - else - return current->active_memcg; -} - static __always_inline bool memcg_kmem_bypass(void) { /* Allow remote memcg charging from any context. */ @@ -6711,7 +6723,8 @@ out: * @gfp_mask: reclaim mode * * Try to charge @page to the memcg that @mm belongs to, reclaiming - * pages according to @gfp_mask if necessary. + * pages according to @gfp_mask if necessary. if @mm is NULL, try to + * charge to the active memcg. * * Do not use this for pages allocated for swapin. * diff --git a/mm/shmem.c b/mm/shmem.c index 53f21016608e..e72931b9246c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1695,7 +1695,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index, { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); - struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm; + struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL; struct swap_info_struct *si; struct page *page = NULL; swp_entry_t swap; @@ -1828,7 +1828,7 @@ repeat: } sbinfo = SHMEM_SB(inode->i_sb); - charge_mm = vma ? vma->vm_mm : current->mm; + charge_mm = vma ? vma->vm_mm : NULL; page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); -- cgit 1.4.1 From c74d40e8b5e2ac5eee1ca45b12d3e174915f1d88 Mon Sep 17 00:00:00 2001 From: Dan Schatzberg Date: Mon, 28 Jun 2021 19:38:21 -0700 Subject: loop: charge i/o to mem and blk cg The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg Acked-by: Johannes Weiner Acked-by: Jens Axboe Cc: Chris Down Cc: Michal Hocko Cc: Ming Lei Cc: Shakeel Butt Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++--------------- drivers/block/loop.h | 3 ++- include/linux/memcontrol.h | 6 +++++ kernel/cgroup/cgroup.c | 1 + mm/memcontrol.c | 1 + 5 files changed, 51 insertions(+), 21 deletions(-) (limited to 'mm/memcontrol.c') diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 54ed3ebbbc37..452c7437e1f0 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -78,6 +78,7 @@ #include #include #include +#include #include "loop.h" @@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) { struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); - if (cmd->css) - css_put(cmd->css); cmd->ret = ret; lo_rw_aio_do_completion(cmd); } @@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, cmd->iocb.ki_complete = lo_rw_aio_complete; cmd->iocb.ki_flags = IOCB_DIRECT; cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); - if (cmd->css) - kthread_associate_blkcg(cmd->css); if (rw == WRITE) ret = call_write_iter(file, &cmd->iocb, &iter); @@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, ret = call_read_iter(file, &cmd->iocb, &iter); lo_rw_aio_do_completion(cmd); - kthread_associate_blkcg(NULL); if (ret != -EIOCBQUEUED) cmd->iocb.ki_complete(&cmd->iocb, ret, 0); @@ -928,7 +924,7 @@ struct loop_worker { struct list_head cmd_list; struct list_head idle_list; struct loop_device *lo; - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *blkcg_css; unsigned long last_ran_at; }; @@ -957,7 +953,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) spin_lock_irq(&lo->lo_work_lock); - if (queue_on_root_worker(cmd->css)) + if (queue_on_root_worker(cmd->blkcg_css)) goto queue_work; node = &lo->worker_tree.rb_node; @@ -965,10 +961,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) while (*node) { parent = *node; cur_worker = container_of(*node, struct loop_worker, rb_node); - if (cur_worker->css == cmd->css) { + if (cur_worker->blkcg_css == cmd->blkcg_css) { worker = cur_worker; break; - } else if ((long)cur_worker->css < (long)cmd->css) { + } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) { node = &(*node)->rb_left; } else { node = &(*node)->rb_right; @@ -980,13 +976,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN); /* * In the event we cannot allocate a worker, just queue on the - * rootcg worker + * rootcg worker and issue the I/O as the rootcg */ - if (!worker) + if (!worker) { + cmd->blkcg_css = NULL; + if (cmd->memcg_css) + css_put(cmd->memcg_css); + cmd->memcg_css = NULL; goto queue_work; + } - worker->css = cmd->css; - css_get(worker->css); + worker->blkcg_css = cmd->blkcg_css; + css_get(worker->blkcg_css); INIT_WORK(&worker->work, loop_workfn); INIT_LIST_HEAD(&worker->cmd_list); INIT_LIST_HEAD(&worker->idle_list); @@ -1306,7 +1307,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) idle_list) { list_del(&worker->idle_list); rb_erase(&worker->rb_node, &lo->worker_tree); - css_put(worker->css); + css_put(worker->blkcg_css); kfree(worker); } spin_unlock_irq(&lo->lo_work_lock); @@ -2100,13 +2101,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, } /* always use the first bio's css */ + cmd->blkcg_css = NULL; + cmd->memcg_css = NULL; #ifdef CONFIG_BLK_CGROUP - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { - cmd->css = &bio_blkcg(rq->bio)->css; - css_get(cmd->css); - } else + if (rq->bio && rq->bio->bi_blkg) { + cmd->blkcg_css = &bio_blkcg(rq->bio)->css; +#ifdef CONFIG_MEMCG + cmd->memcg_css = + cgroup_get_e_css(cmd->blkcg_css->cgroup, + &memory_cgrp_subsys); +#endif + } #endif - cmd->css = NULL; loop_queue_work(lo, cmd); return BLK_STS_OK; @@ -2118,13 +2124,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd) const bool write = op_is_write(req_op(rq)); struct loop_device *lo = rq->q->queuedata; int ret = 0; + struct mem_cgroup *old_memcg = NULL; if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) { ret = -EIO; goto failed; } + if (cmd->blkcg_css) + kthread_associate_blkcg(cmd->blkcg_css); + if (cmd->memcg_css) + old_memcg = set_active_memcg( + mem_cgroup_from_css(cmd->memcg_css)); + ret = do_req_filebacked(lo, rq); + + if (cmd->blkcg_css) + kthread_associate_blkcg(NULL); + + if (cmd->memcg_css) { + set_active_memcg(old_memcg); + css_put(cmd->memcg_css); + } failed: /* complete non-aio request */ if (!cmd->use_aio || ret) { @@ -2203,7 +2224,7 @@ static void loop_free_idle_workers(struct timer_list *timer) break; list_del(&worker->idle_list); rb_erase(&worker->rb_node, &lo->worker_tree); - css_put(worker->css); + css_put(worker->blkcg_css); kfree(worker); } if (!list_empty(&lo->idle_worker_list)) diff --git a/drivers/block/loop.h b/drivers/block/loop.h index f81c01bde5c0..1988899db63a 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -77,7 +77,8 @@ struct loop_cmd { long ret; struct kiocb iocb; struct bio_vec *bvec; - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *blkcg_css; + struct cgroup_subsys_state *memcg_css; }; /* Support for loadable transfer modules */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3cc18c2176e7..1de3859233a6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1230,6 +1230,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) return NULL; } +static inline +struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) +{ + return NULL; +} + static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 21ecc6ee6a6d..9cc8c3a686b1 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -577,6 +577,7 @@ out_unlock: rcu_read_unlock(); return css; } +EXPORT_SYMBOL_GPL(cgroup_get_e_css); static void cgroup_get_live(struct cgroup *cgrp) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f3244e59b30..4ee243ce6135 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly; /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); +EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg); /* Socket memory accounting disabled? */ static bool cgroup_memory_nosocket; -- cgit 1.4.1