summary refs log tree commit diff
path: root/fs
AgeCommit message (Collapse)Author
2022-03-18netfs: Split some core bits out into their own fileDavid Howells
Split some core bits out into their own file. More bits will be added to this file later. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164623006934.3564931.17932680017894039748.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678218407.1200972.1731208226140990280.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692920944.2099075.11990502173226013856.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Split fs/netfs/read_helper.cDavid Howells
Split fs/netfs/read_helper.c into two pieces, one to deal with buffered writes and one to deal with the I/O mechanism. Changes ======= ver #2) - Add kdoc reference to new file. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164623005586.3564931.6149556072728481767.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678217075.1200972.5101072043126828757.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692919953.2099075.7156989585513833046.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Rename read_helper.c to io.cDavid Howells
Rename the read_helper.c file to io.c before splitting out the buffered read functions and some other bits. Changes ======= ver #2) - Rename read_helper.c before splitting. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164678216109.1200972.16567696909952495832.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692918076.2099075.8120961172717347610.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Prepare to split read_helper.cDavid Howells
Rename netfs_rreq_unlock() to netfs_rreq_unlock_folios() to make it sound less like it's dropping a lock on an netfs_io_request struct. Remove the 'static' marker on netfs_rreq_unlock_folios() and declaring it in internal.h preparatory to splitting the file. Changes ======= ver #2) - Slide this patch to after the one adding netfs_begin_read(). - As a consequence, don't need to unstatic so many functions. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164623002861.3564931.17340149482236413375.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678215208.1200972.9761906209395002182.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692912709.2099075.4349905992838317797.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Add a function to consolidate beginning a readDavid Howells
Add a function to do the steps needed to begin a read request, allowing this code to be removed from several other functions and consolidated. Changes ======= ver #2) - Move before the unstaticking patch so that some functions can be left static. - Set uninitialised return code in netfs_begin_read()[1][2]. - Fixed a refleak caused by non-removal of a get from netfs_write_begin() when the request submission code got moved to netfs_begin_read(). - Use INIT_WORK() to (re-)init the request work_struct[3]. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/20220303163826.1120936-1-nathan@kernel.org/ [1] Link: https://lore.kernel.org/r/20220303235647.1297171-1-colin.i.king@gmail.com/ [2] Link: https://lore.kernel.org/r/9d69be49081bccff44260e4c6e0049c63d6d04a1.camel@redhat.com/ [3] Link: https://lore.kernel.org/r/164623004355.3564931.7275693529042495641.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678214287.1200972.16734134007649832160.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692911113.2099075.1060868473229451371.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Add a netfs inode contextDavid Howells
Add a netfs_i_context struct that should be included in the network filesystem's own inode struct wrapper, directly after the VFS's inode struct, e.g.: struct my_inode { struct { /* These must be contiguous */ struct inode vfs_inode; struct netfs_i_context netfs_ctx; }; }; The netfs_i_context struct so far contains a single field for the network filesystem to use - the cache cookie: struct netfs_i_context { ... struct fscache_cookie *cache; }; Three functions are provided to help with this: (1) void netfs_i_context_init(struct inode *inode, const struct netfs_request_ops *ops); Initialise the netfs context and set the operations. (2) struct netfs_i_context *netfs_i_context(struct inode *inode); Find the netfs context from the VFS inode. (3) struct inode *netfs_inode(struct netfs_i_context *ctx); Find the VFS inode from the netfs context. Changes ======= ver #4) - Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a cache is present[3]. - Fix netfs_skip_folio_read() to zero out all of the page, not just some of it[3]. ver #3) - Split out the bit to move ceph cap-getting on readahead into ceph_init_request()[1]. - Stick in a comment to the netfs inode structs indicating the contiguity requirements[2]. ver #2) - Adjust documentation to match. - Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef". - Move the cap check from ceph_readahead() to ceph_init_request() to be called from netfslib. - Remove ceph_readahead() and use netfs_readahead() directly instead. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2] Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3] Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4
2022-03-18ceph: Make ceph_init_request() check caps on readaheadDavid Howells
Move the caps check from ceph_readahead() to ceph_init_request(), conditional on the origin being NETFS_READAHEAD so that in a future patch, ceph can point its ->readahead() vector directly at netfs_readahead(). Changes ======= ver #4) - Move the check for NETFS_READAHEAD up in ceph_init_request()[2]. ver #3) - Split from the patch to add a netfs inode context[1]. - Need to store the caps got in rreq->netfs_priv for later freeing. Signed-off-by: David Howells <dhowells@redhat.com> cc: ceph-devel@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/dd054c962818716e718bd9b446ee5322ca097675.camel@redhat.com/ [2] Link: https://lore.kernel.org/r/164692907694.2099075.10081819855690054094.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/2533821.1647006574@warthog.procyon.org.uk/ # v4
2022-03-18netfs: Change ->init_request() to return an error codeDavid Howells
Change the request initialisation function to return an error code so that the network filesystem can return a failure (ENOMEM, for example). This will also allow ceph to abort a ->readahead() op if the server refuses to give it a cap allowing local caching from within the netfslib framework (errors aren't passed back through ->readahead(), so returning, say, -ENOBUFS will cause the op to be aborted). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164678212401.1200972.16537041523832944934.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692905398.2099075.5238033621684646524.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Refactor arguments for netfs_alloc_read_requestDavid Howells
Pass start and len to the rreq allocator. This should ensure that the fields are set so that ->init_request() can use them. Also add a parameter to indicates the origin of the request. Ceph can use this to tell whether to get caps. Changes ======= ver #3) - Change the author to me as Jeff feels that most of the patch is my changes now. ver #2) - Show the request origin in the netfs_rreq tracepoint. Signed-off-by: Jeff Layton <jlayton@kernel.org> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622989020.3564931.17517006047854958747.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678208569.1200972.12153682697842916557.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692904155.2099075.14717645623034355995.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Trace refcounting on the netfs_io_subrequest structDavid Howells
Add refcount tracing for the netfs_io_subrequest structure. Changes ======= ver #3) - Switch 'W=' to 'R=' in the traceline to match other request debug IDs. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622998584.3564931.5052255990645723639.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678202603.1200972.14726007419792315578.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692901860.2099075.4845820886851239935.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Trace refcounting on the netfs_io_request structDavid Howells
Add refcount tracing for the netfs_io_request structure. Changes ======= ver #3) - Switch 'W=' to 'R=' in the traceline to match other request debug IDs. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622997668.3564931.14456171619219324968.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678200943.1200972.7241495532327787765.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692900920.2099075.11847712419940675791.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Adjust the netfs_rreq tracepoint slightlyDavid Howells
Adjust the netfs_rreq tracepoint to include the origin of the request and to increase the size of the "what trace" output strings by a character so that "ENCRYPT" and "DECRYPT" will fit without abbreviation. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622996715.3564931.4252319907990358129.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678199468.1200972.17275585970238114726.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692898684.2099075.12153225958137716567.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Split netfs_io_* object handling outDavid Howells
Split netfs_io_* object handling out into a file that's going to contain object allocation, get and put routines. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622995118.3564931.6089530629052064470.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678197044.1200972.11511937252083343775.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692894693.2099075.7831091294248735173.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Finish off rename of netfs_read_request to netfs_io_requestDavid Howells
Adjust helper function names and comments after mass rename of struct netfs_read_*request to struct netfs_io_*request. Changes ======= ver #2) - Make the changes in the docs also. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622992433.3564931.6684311087845150271.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678196111.1200972.5001114956865989528.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692892567.2099075.13895804222087028813.stgit@warthog.procyon.org.uk/ # v3
2022-03-18netfs: Rename netfs_read_*request to netfs_io_*requestDavid Howells
Rename netfs_read_*request to netfs_io_*request so that the same structures can be used for the write helpers too. perl -p -i -e 's/netfs_read_(request|subrequest)/netfs_io_$1/g' \ `git grep -l 'netfs_read_\(sub\|\)request'` perl -p -i -e 's/nr_rd_ops/nr_outstanding/g' \ `git grep -l nr_rd_ops` perl -p -i -e 's/nr_wr_ops/nr_copy_ops/g' \ `git grep -l nr_wr_ops` perl -p -i -e 's/netfs_read_source/netfs_io_source/g' \ `git grep -l 'netfs_read_source'` perl -p -i -e 's/netfs_io_request_ops/netfs_request_ops/g' \ `git grep -l 'netfs_io_request_ops'` perl -p -i -e 's/init_rreq/init_request/g' \ `git grep -l 'init_rreq'` Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164622988070.3564931.7089670190434315183.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678195157.1200972.366609966927368090.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692891535.2099075.18435198075367420588.stgit@warthog.procyon.org.uk/ # v3
2022-03-18fscache: export fscache_end_operation()Jeffle Xu
Export fscache_end_operation() to avoid code duplication. Besides, considering the paired fscache_begin_read_operation() is already exported, it shall make sense to also export fscache_end_operation(). Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/20220302125134.131039-2-jefflexu@linux.alibaba.com/ # Jeffle's v4 Link: https://lore.kernel.org/r/164622971432.3564931.12184135678781328146.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678190346.1200972.7453733431978569479.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692888334.2099075.5166283293894267365.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/20220316131723.111553-2-jefflexu@linux.alibaba.com/ # v5
2022-03-01ceph: misc fix for code style and logsXiubo Li
To make the logs more readable such as for log likes: ceph: will move 00000000a42b796b to split realm 100000003ed 000000007146df45 With this it will always show the inode numbers instead the inode addresses. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: allocate capsnap memory outside of ceph_queue_cap_snap()Xiubo Li
This will reduce very possible but unnecessary frequently memory allocate/free in this loop. URL: https://tracker.ceph.com/issues/44100 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: do not release the global snaprealm until unmountingXiubo Li
The global snaprealm would be created and then destroyed immediately every time when updating it. URL: https://tracker.ceph.com/issues/54362 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: eliminate the recursion when rebuilding the snap contextXiubo Li
Use a list instead of recursion to avoid possible stack overflow. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: do not update snapshot context when there is no new snapshotXiubo Li
We will only track the uppest parent snapshot realm from which we need to rebuild the snapshot contexts _downward_ in hierarchy. For all the others having no new snapshot we will do nothing. This fix will avoid calling ceph_queue_cap_snap() on some inodes inappropriately. For example, with the code in mainline, suppose there are 2 directory hierarchies (with 6 directories total), like this: /dir_X1/dir_X2/dir_X3/ /dir_Y1/dir_Y2/dir_Y3/ Firstly, make a snapshot under /dir_X1/dir_X2/.snap/snap_X2, then make a root snapshot under /.snap/root_snap. Every time we make snapshots under /dir_Y1/..., the kclient will always try to rebuild the snap context for snap_X2 realm and finally will always try to queue cap snaps for dir_Y2 and dir_Y3, which makes no sense. That's because the snap_X2's seq is 2 and root_snap's seq is 3. So when creating a new snapshot under /dir_Y1/... the new seq will be 4, and the mds will send the kclient a snapshot backtrace in _downward_ order: seqs 4, 3. When ceph_update_snap_trace() is called, it will always rebuild the from the last realm, that's the root_snap. So later when rebuilding the snap context, the current logic will always cause it to rebuild the snap_X2 realm and then try to queue cap snaps for all the inodes related in that realm, even though it's not necessary. This is accompanied by a lot of these sorts of dout messages: "ceph: queue_cap_snap 00000000a42b796b nothing dirty|writing" Fix the logic to avoid this situation. Also, the 'invalidate' word is not precise here. In actuality, it will cause a rebuild of the existing snapshot contexts or just build non-existent ones. Rename it to 'rebuild_snapcs'. URL: https://tracker.ceph.com/issues/44100 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: zero the dir_entries memory when allocating itXiubo Li
This potentially will cause a bug in future if using an old ceph version that sends a smaller inode struct, which can cause some members to be skipped in handle_reply. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: move to a dedicated slabcache for ceph_cap_snapXiubo Li
There could be huge number of capsnaps around at any given time. On x86_64 the structure is 248 bytes, which will be rounded up to 256 bytes by kzalloc. Move this to a dedicated slabcache to save 8 bytes for each. [ jlayton: use kmem_cache_zalloc ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: add getvxattr opMilind Changire
Problem: Some directory vxattrs (e.g. ceph.dir.pin.random) are governed by information that isn't necessarily shared with the client. Add support for the new GETVXATTR operation, which allows the client to query the MDS directly for vxattrs. When the client is queried for a vxattr that doesn't have a special handler, have it issue a GETVXATTR to the MDS directly. Solution: Adds new getvxattr op to fetch ceph.dir.pin*, ceph.dir.layout* and ceph.file.layout* vxattrs. If the entire layout for a dir or a file is being set, then it is expected that the layout be set in standard JSON format. Individual field value retrieval is not wrapped in JSON. The JSON format also applies while setting the vxattr if the entire layout is being set in one go. As a temporary measure, setting a vxattr can also be done in the old format. The old format will be deprecated in the future. URL: https://tracker.ceph.com/issues/51062 Signed-off-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: fix comments mentioning i_mutexhongnanli
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: fail the request directly if handle_reply gets an ESTALEXiubo Li
If MDS return ESTALE, that means the MDS has already iterated all the possible active MDSes including the auth MDS or the inode is under purging. No need to retry in auth MDS and will just return ESTALE directly. Retrying in this situation will cause an infinite loop. Also, retrying like this would prevent the kernel VFS layer ESTALE handling from working properly. An ESTALE error is usually an indication that the dcache is wrong, so we want to allow the VFS to redo the lookup and revalidate it properly. URL: https://tracker.ceph.com/issues/53504 Signed-off-by: Xiubo Li <xiubli@redhat.com> Acked-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: wake waiters after failed async createJeff Layton
Currently, we only wake the waiters if we got a req->r_target_inode out of the reply. In the case where the create fails though, we may not have one. If there is a non-zero result from the create, then ensure that we wake anything waiting on the inode after we shut it down. URL: https://tracker.ceph.com/issues/54067 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: wait for async create reply before sending any cap messagesJeff Layton
If we haven't received a reply to an async create request, then we don't want to send any cap messages to the MDS for that inode yet. Just have ceph_check_caps and __kick_flushing_caps return without doing anything, and have ceph_write_inode wait for the reply if we were asked to wait on the inode writeback. URL: https://tracker.ceph.com/issues/54107 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: eliminate req->r_wait_for_completion from ceph_mds_requestJeff Layton
...and instead just pass the wait function on the stack. Make ceph_mdsc_wait_request non-static, and add an argument for wait for completion. Then have ceph_lock_message call ceph_mdsc_submit_request, and ceph_mdsc_wait_request and pass in the pointer to ceph_lock_wait_for_completion. While we're in there, rearrange some fields in ceph_mds_request, so we save a total of 24 bytes per. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: uninline the data on a file opened for writingDavid Howells
If a ceph file is made up of inline data, uninline that in the ceph_open() rather than in ceph_page_mkwrite(), ceph_write_iter(), ceph_fallocate() or ceph_write_begin(). This makes it easier to convert to using the netfs library for VM write hooks. Should this also take the inode lock for the duration on uninlining to prevent a race with truncation? [ jlayton: fix up folio locking, update i_inline_version after write ] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: make ceph_netfs_issue_op() handle inlined dataDavid Howells
Make ceph_netfs_issue_op() handle inlined data on page 0. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: switch netfs read ops to use rreq->inode instead of rreq->mapping->hostJeff Layton
One fewer pointer dereference, and in the future we may not be able to count on the mapping pointer being populated (e.g. in the DIO case). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-02-26Merge tag 'trace-v5.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - rtla (Real-Time Linux Analysis tool): - fix typo in man page - Update API -e to -E before it is released - Error message fix and memory leak fix - Partially uninline trace event soft disable to shrink text - Fix function graph start up test - Have triggers affect the trace instance they are in and not top level - Have osnoise sleep in the units it says it uses - Remove unused ftrace stub function - Remove event probe redundant info from event in the buffer - Fix group ownership setting in tracefs - Ensure trace buffer is minimum size to prevent crashes * tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: rtla/osnoise: Fix error message when failing to enable trace instance rtla/osnoise: Free params at the exit rtla/hist: Make -E the short version of --entries tracing: Fix selftest config check for function graph start up test tracefs: Set the group ownership in apply_options() not parse_options() tracing/osnoise: Make osnoise_main to sleep for microseconds ftrace: Remove unused ftrace_startup_enable() stub tracing: Ensure trace buffer is at least 4096 bytes large tracing: Uninline trace_trigger_soft_disabled() partly eprobes: Remove redundant event type information tracing: Have traceon and traceoff trigger honor the instance tracing: Dump stacktrace trigger to the corresponding instance rtla: Fix systme -> system typo on man page
2022-02-26Merge tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "Nothing exciting, just more fixes for not returning sync_filesystem error values (and eliding it when it's not necessary). Summary: - Only call sync_filesystem when we're remounting the filesystem readonly readonly, and actually check its return value" * tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: only bother with sync_filesystem during readonly remount
2022-02-25tracefs: Set the group ownership in apply_options() not parse_options()Steven Rostedt (Google)
Al Viro brought it to my attention that the dentries may not be filled when the parse_options() is called, causing the call to set_gid() to possibly crash. It should only be called if parse_options() succeeds totally anyway. He suggested the logical place to do the update is in apply_options(). Link: https://lore.kernel.org/all/20220225165219.737025658@goodmis.org/ Link: https://lkml.kernel.org/r/20220225153426.1c4cab6b@gandalf.local.home Cc: stable@vger.kernel.org Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25Merge tag 'configfs-5.17-2022-02-25' of ↵Linus Torvalds
git://git.infradead.org/users/hch/configfs Pull configfs fix from Christoph Hellwig: - fix a race in configfs_{,un}register_subsystem (ChenXiaoSong) * tag 'configfs-5.17-2022-02-25' of git://git.infradead.org/users/hch/configfs: configfs: fix a race in configfs_{,un}register_subsystem()
2022-02-25Merge tag 'for-5.17-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "This is a hopefully last batch of fixes for defrag that got broken in 5.16, all stable material. The remaining reported problem is excessive IO with autodefrag due to various conditions in the defrag code not met or missing" * tag 'for-5.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: reduce extent threshold for autodefrag btrfs: autodefrag: only scan one inode once btrfs: defrag: don't use merged extent map for their generation check btrfs: defrag: bring back the old file extent search behavior btrfs: defrag: remove an ambiguous condition for rejection btrfs: defrag: don't defrag extents which are already at max capacity btrfs: defrag: don't try to merge regular extents with preallocated extents btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target btrfs: prevent copying too big compressed lzo segment
2022-02-24Merge tag 'io_uring-5.17-2022-02-23' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Add a conditional schedule point in io_add_buffers() (Eric) - Fix for a quiesce speedup merged in this release (Dylan) - Don't convert to jiffies for event timeout waiting, it's way too coarse when we accept a timespec as input (me) * tag 'io_uring-5.17-2022-02-23' of git://git.kernel.dk/linux-block: io_uring: disallow modification of rsrc_data during quiesce io_uring: don't convert to jiffies for waiting on timeouts io_uring: add a schedule point in io_add_buffers()
2022-02-24btrfs: reduce extent threshold for autodefragQu Wenruo
There is a big gap between inode_should_defrag() and autodefrag extent size threshold. For inode_should_defrag() it has a flexible @small_write value. For compressed extent is 16K, and for non-compressed extent it's 64K. However for autodefrag extent size threshold, it's always fixed to the default value (256K). This means, the following write sequence will trigger autodefrag to defrag ranges which didn't trigger autodefrag: pwrite 0 8k sync pwrite 8k 128K sync The latter 128K write will also be considered as a defrag target (if other conditions are met). While only that 8K write is really triggering autodefrag. Such behavior can cause extra IO for autodefrag. Close the gap, by copying the @small_write value into inode_defrag, so that later autodefrag can use the same @small_write value which triggered autodefrag. With the existing transid value, this allows autodefrag really to scan the ranges which triggered autodefrag. Although this behavior change is mostly reducing the extent_thresh value for autodefrag, I believe in the future we should allow users to specify the autodefrag extent threshold through mount options, but that's an other problem to consider in the future. CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: autodefrag: only scan one inode onceQu Wenruo
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are still just exhausting all inode_defrag items in the tree. This means, it doesn't make much difference to requeue an inode_defrag, other than scan the inode from the beginning till its end. Change the behaviour to always scan from offset 0 of an inode, and till the end. By this we get the following benefit: - Straight-forward code - No more re-queue related check - Fewer members in inode_defrag We still keep the same btrfs_get_fs_root() and btrfs_iget() check for each loop, and added extra should_auto_defrag() check per-loop. Note: the patch needs to be backported and is intentionally written to minimize the diff size, code will be cleaned up later. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: don't use merged extent map for their generation checkQu Wenruo
For extent maps, if they are not compressed extents and are adjacent by logical addresses and file offsets, they can be merged into one larger extent map. Such merged extent map will have the higher generation of all the original ones. But this brings a problem for autodefrag, as it relies on accurate extent_map::generation to determine if one extent should be defragged. For merged extent maps, their higher generation can mark some older extents to be defragged while the original extent map doesn't meet the minimal generation threshold. Thus this will cause extra IO. So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to indicate if the extent map is merged from one or more ems. And for autodefrag, if we find a merged extent map, and its generation meets the generation requirement, we just don't use this one, and go back to defrag_get_extent() to read extent maps from subvolume trees. This could cause more read IO, but should result less defrag data write, so in the long run it should be a win for autodefrag. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: bring back the old file extent search behaviorQu Wenruo
For defrag, we don't really want to use btrfs_get_extent() to iterate all extent maps of an inode. The reasons are: - btrfs_get_extent() can merge extent maps And the result em has the higher generation of the two, causing defrag to mark unnecessary part of such merged large extent map. This in fact can result extra IO for autodefrag in v5.16+ kernels. However this patch is not going to completely solve the problem, as one can still using read() to trigger extent map reading, and got them merged. The completely solution for the extent map merging generation problem will come as an standalone fix. - btrfs_get_extent() caches the extent map result Normally it's fine, but for defrag the target range may not get another read/write for a long long time. Such cache would only increase the memory usage. - btrfs_get_extent() doesn't skip older extent map Unlike the old find_new_extent() which uses btrfs_search_forward() to skip the older subtree, thus it will pick up unnecessary extent maps. This patch will fix the regression by introducing defrag_get_extent() to replace the btrfs_get_extent() call. This helper will: - Not cache the file extent we found It will search the file extent and manually convert it to em. - Use btrfs_search_forward() to skip entire ranges which is modified in the past This should reduce the IO for autodefrag. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: remove an ambiguous condition for rejectionQu Wenruo
From the very beginning of btrfs defrag, there is a check to reject extents which meet both conditions: - Physically adjacent We may want to defrag physically adjacent extents to reduce the number of extents or the size of subvolume tree. - Larger than 128K This may be there for compressed extents, but unfortunately 128K is exactly the max capacity for compressed extents. And the check is > 128K, thus it never rejects compressed extents. Furthermore, the compressed extent capacity bug is fixed by previous patch, there is no reason for that check anymore. The original check has a very small ranges to reject (the target extent size is > 128K, and default extent threshold is 256K), and for compressed extent it doesn't work at all. So it's better just to remove the rejection, and allow us to defrag physically adjacent extents. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: don't defrag extents which are already at max capacityQu Wenruo
[BUG] For compressed extents, defrag ioctl will always try to defrag any compressed extents, wasting not only IO but also CPU time to compress/decompress: mkfs.btrfs -f $DEV mount -o compress $DEV $MNT xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar sync xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar sync echo "=== before ===" xfs_io -c "fiemap -v" $MNT/foobar btrfs filesystem defrag $MNT/foobar sync echo "=== after ===" xfs_io -c "fiemap -v" $MNT/foobar Then it shows the 2 128K extents just get COW for no extra benefit, with extra IO/CPU spent: === before === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26624..26879 256 0x8 1: [256..511]: 26632..26887 256 0x9 === after === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26640..26895 256 0x8 1: [256..511]: 26648..26903 256 0x9 This affects not only v5.16 (after the defrag rework), but also v5.15 (before the defrag rework). [CAUSE] From the very beginning, btrfs defrag never checks if one extent is already at its max capacity (128K for compressed extents, 128M otherwise). And the default extent size threshold is 256K, which is already beyond the compressed extent max size. This means, by default btrfs defrag ioctl will mark all compressed extent which is not adjacent to a hole/preallocated range for defrag. [FIX] Introduce a helper to grab the maximum extent size, and then in defrag_collect_targets() and defrag_check_next_extent(), reject extents which are already at their max capacity. Reported-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: don't try to merge regular extents with preallocated extentsQu Wenruo
[BUG] With older kernels (before v5.16), btrfs will defrag preallocated extents. While with newer kernels (v5.16 and newer) btrfs will not defrag preallocated extents, but it will defrag the extent just before the preallocated extent, even it's just a single sector. This can be exposed by the following small script: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file xfs_io -c "fiemap -v" $mnt/file btrfs fi defrag $mnt/file sync xfs_io -c "fiemap -v" $mnt/file The output looks like this on older kernels: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..39]: 26664..26703 40 0x1 Which defrags the single sector along with the preallocated extent, and replace them with an regular extent into a new location (caused by data COW). This wastes most of the data IO just for the preallocated range. On the other hand, v5.16 is slightly better: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26664..26671 8 0x0 1: [8..39]: 26632..26663 32 0x801 The preallocated range is not defragged, but the sector before it still gets defragged, which has no need for it. [CAUSE] One of the function reused by the old and new behavior is defrag_check_next_extent(), it will determine if we should defrag current extent by checking the next one. It only checks if the next extent is a hole or inlined, but it doesn't check if it's preallocated. On the other hand, out of the function, both old and new kernel will reject preallocated extents. Such inconsistent behavior causes above behavior. [FIX] - Also check if next extent is preallocated If so, don't defrag current extent. - Add comments for each branch why we reject the extent This will reduce the IO caused by defrag ioctl and autodefrag. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-22configfs: fix a race in configfs_{,un}register_subsystem()ChenXiaoSong
When configfs_register_subsystem() or configfs_unregister_subsystem() is executing link_group() or unlink_group(), it is possible that two processes add or delete list concurrently. Some unfortunate interleavings of them can cause kernel panic. One of cases is: A --> B --> C --> D A <-- B <-- C <-- D delete list_head *B | delete list_head *C --------------------------------|----------------------------------- configfs_unregister_subsystem | configfs_unregister_subsystem unlink_group | unlink_group unlink_obj | unlink_obj list_del_init | list_del_init __list_del_entry | __list_del_entry __list_del | __list_del // next == C | next->prev = prev | | next->prev = prev prev->next = next | | // prev == B | prev->next = next Fix this by adding mutex when calling link_group() or unlink_group(), but parent configfs_subsystem is NULL when config_item is root. So I create a mutex configfs_subsystem_mutex. Fixes: 7063fbf22611 ("[PATCH] configfs: User-driven configuration filesystem") Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Laibin Qiu <qiulaibin@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-22io_uring: disallow modification of rsrc_data during quiesceDylan Yudaken
io_rsrc_ref_quiesce will unlock the uring while it waits for references to the io_rsrc_data to be killed. There are other places to the data that might add references to data via calls to io_rsrc_node_switch. There is a race condition where this reference can be added after the completion has been signalled. At this point the io_rsrc_ref_quiesce call will wake up and relock the uring, assuming the data is unused and can be freed - although it is actually being used. To fix this check in io_rsrc_ref_quiesce if a resource has been revived. Reported-by: syzbot+ca8bf833622a1662745b@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220222161751.995746-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-21io_uring: don't convert to jiffies for waiting on timeoutsJens Axboe
If an application calls io_uring_enter(2) with a timespec passed in, convert that timespec to ktime_t rather than jiffies. The latter does not provide the granularity the application may expect, and may in fact provided different granularity on different systems, depending on what the HZ value is configured at. Turn the timespec into an absolute ktime_t, and use that with schedule_hrtimeout() instead. Link: https://github.com/axboe/liburing/issues/531 Cc: stable@vger.kernel.org Reported-by: Bob Chen <chenbo.chen@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-20Merge tag 'fs.mount_setattr.v5.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr test/doc fixes from Christian Brauner: "This contains a fix for one of the selftests for the mount_setattr syscall to create idmapped mounts, an entry for idmapped mounts for maintainers, and missing kernel documentation for the helper we split out some time ago to get and yield write access to a mount when changing mount properties" * tag 'fs.mount_setattr.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: add kernel doc for mnt_{hold,unhold}_writers() MAINTAINERS: add entry for idmapped mounts tests: fix idmapped mount_setattr test
2022-02-18Merge tag 'nfs-for-5.17-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: - Fix unnecessary changeattr revalidations - Fix resolving symlinks during directory lookups - Don't report writeback errors in nfs_getattr() * tag 'nfs-for-5.17-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Do not report writeback errors in nfs_getattr() NFS: LOOKUP_DIRECTORY is also ok with symlinks NFS: Remove an incorrect revalidation in nfs4_update_changeattr_locked()