From 4c80a97d7b02cf68e169118ef2bda0725fc87f6f Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Tue, 25 May 2021 13:52:43 +0800
Subject: btrfs: fix compressed writes that cross stripe boundary

[BUG]
When running btrfs/027 with "-o compress" mount option, it always
crashes with the following call trace:

  BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/volumes.c:6651!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G           OE     5.13.0-rc2-custom+ #26
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs]
  Call Trace:
   btrfs_submit_compressed_write+0x2d7/0x470 [btrfs]
   submit_compressed_extents+0x3b0/0x470 [btrfs]
   ? mark_held_locks+0x49/0x70
   btrfs_work_helper+0x131/0x3e0 [btrfs]
   process_one_work+0x28f/0x5d0
   worker_thread+0x55/0x3c0
   ? process_one_work+0x5d0/0x5d0
   kthread+0x141/0x160
   ? __kthread_bind_mask+0x60/0x60
   ret_from_fork+0x22/0x30
  ---[ end trace 63113a3a91f34e68 ]---

[CAUSE]
The critical message before the crash means we have a bio at logical
bytenr 298901504 length 12288, but only 8192 bytes can fit into one
stripe, the remaining 4096 bytes go to another stripe.

In btrfs, all bios are properly split to avoid cross stripe boundary,
but commit 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
changed the behavior for compressed writes.

Previously if we find our new page can't be fitted into current stripe,
ie. "submit == 1" case, we submit current bio without adding current
page.

       submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);

   page->mapping = NULL;
   if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
       PAGE_SIZE) {

But after the modification, we will add the page no matter if it crosses
stripe boundary, leading to the above crash.

       submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);

   if (pg_index == 0 && use_append)
           len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
   else
           len = bio_add_page(bio, page, PAGE_SIZE, 0);

   page->mapping = NULL;
   if (submit || len < PAGE_SIZE) {

[FIX]
It's no longer possible to revert to the original code style as we have
two different bio_add_*_page() calls now.

The new fix is to skip the bio_add_*_page() call if @submit is true.

Also to avoid @len to be uninitialized, always initialize it to zero.

If @submit is true, @len will not be checked.
If @submit is not true, @len will be the return value of
bio_add_*_page() call.
Either way, the behavior is still the same as the old code.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/compression.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 91743a0b34c5..5cb4f3b88285 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -457,7 +457,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	bytes_left = compressed_len;
 	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
 		int submit = 0;
-		int len;
+		int len = 0;
 
 		page = compressed_pages[pg_index];
 		page->mapping = inode->vfs_inode.i_mapping;
@@ -465,10 +465,17 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
 							  0);
 
-		if (pg_index == 0 && use_append)
-			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
-		else
-			len = bio_add_page(bio, page, PAGE_SIZE, 0);
+		/*
+		 * Page can only be added to bio if the current bio fits in
+		 * stripe.
+		 */
+		if (!submit) {
+			if (pg_index == 0 && use_append)
+				len = bio_add_zone_append_page(bio, page,
+							       PAGE_SIZE, 0);
+			else
+				len = bio_add_page(bio, page, PAGE_SIZE, 0);
+		}
 
 		page->mapping = NULL;
 		if (submit || len < PAGE_SIZE) {
-- 
cgit 1.4.1


From b86652be7c83f70bf406bed18ecf55adb9bfb91b Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 10:52:45 -0400
Subject: btrfs: fix error handling in btrfs_del_csums

Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent.  This occurred because the pattern
in btrfs_del_csums was

	while (1) {
		ret = btrfs_search_slot();
		if (ret < 0)
			break;
	}
	ret = 0;
out:
	btrfs_free_path(path);
	return ret;

If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out.  Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the

	ret = 0;
out:

pattern and simply allow break to have the proper error reporting.  With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/file-item.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 294602f139ef..a5a8dac334e8 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -788,7 +788,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 	u64 end_byte = bytenr + len;
 	u64 csum_end;
 	struct extent_buffer *leaf;
-	int ret;
+	int ret = 0;
 	const u32 csum_size = fs_info->csum_size;
 	u32 blocksize_bits = fs_info->sectorsize_bits;
 
@@ -806,6 +806,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 
 		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 		if (ret > 0) {
+			ret = 0;
 			if (path->slots[0] == 0)
 				break;
 			path->slots[0]--;
@@ -862,7 +863,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 			ret = btrfs_del_items(trans, root, path,
 					      path->slots[0], del_nr);
 			if (ret)
-				goto out;
+				break;
 			if (key.offset == bytenr)
 				break;
 		} else if (key.offset < bytenr && csum_end > end_byte) {
@@ -906,8 +907,9 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 			ret = btrfs_split_item(trans, root, path, &key, offset);
 			if (ret && ret != -EAGAIN) {
 				btrfs_abort_transaction(trans, ret);
-				goto out;
+				break;
 			}
+			ret = 0;
 
 			key.offset = end_byte - 1;
 		} else {
@@ -917,8 +919,6 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		}
 		btrfs_release_path(path);
 	}
-	ret = 0;
-out:
 	btrfs_free_path(path);
 	return ret;
 }
-- 
cgit 1.4.1


From 856bd270dc4db209c779ce1e9555c7641ffbc88e Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 10:52:46 -0400
Subject: btrfs: return errors from btrfs_del_csums in cleanup_ref_head

We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail.  We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'fs')

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f1d15b68994a..3d5c35e4cb76 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1868,7 +1868,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	trace_run_delayed_ref_head(fs_info, head, 0);
 	btrfs_delayed_ref_unlock(head);
 	btrfs_put_delayed_ref_head(head);
-	return 0;
+	return ret;
 }
 
 static struct btrfs_delayed_ref_head *btrfs_obtain_ref_head(
-- 
cgit 1.4.1


From d61bec08b904cf171835db98168f82bc338e92e4 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 09:38:27 -0400
Subject: btrfs: mark ordered extent and inode with error if we fail to finish

While doing error injection testing I saw that sometimes we'd get an
abort that wouldn't stop the current transaction commit from completing.
This abort was coming from finish ordered IO, but at this point in the
transaction commit we should have gotten an error and stopped.

It turns out the abort came from finish ordered io while trying to write
out the free space cache.  It occurred to me that any failure inside of
finish_ordered_io isn't actually raised to the person doing the writing,
so we could have any number of failures in this path and think the
ordered extent completed successfully and the inode was fine.

Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
marking the mapping of the inode with mapping_set_error, so any callers
that simply call fdatawait will also get the error.

With this we're seeing the IO error on the free space inode when we fail
to do the finish_ordered_io.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/inode.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

(limited to 'fs')

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bb4ab408d670..e7de0c08b981 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3005,6 +3005,18 @@ out:
 	if (ret || truncated) {
 		u64 unwritten_start = start;
 
+		/*
+		 * If we failed to finish this ordered extent for any reason we
+		 * need to make sure BTRFS_ORDERED_IOERR is set on the ordered
+		 * extent, and mark the inode with the error if it wasn't
+		 * already set.  Any error during writeback would have already
+		 * set the mapping error, so we need to set it if we're the ones
+		 * marking this ordered extent as failed.
+		 */
+		if (ret && !test_and_set_bit(BTRFS_ORDERED_IOERR,
+					     &ordered_extent->flags))
+			mapping_set_error(ordered_extent->inode->i_mapping, -EIO);
+
 		if (truncated)
 			unwritten_start += logical_len;
 		clear_extent_uptodate(io_tree, unwritten_start, end, NULL);
-- 
cgit 1.4.1


From 011b28acf940eb61c000059dd9e2cfcbf52ed96b Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 13:13:15 -0400
Subject: btrfs: fixup error handling in fixup_inode_link_counts

This function has the following pattern

	while (1) {
		ret = whatever();
		if (ret)
			goto out;
	}
	ret = 0
out:
	return ret;

However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.

Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the

	ret = 0;
out:

bit so everybody can break if there is an error, which will allow for
proper error handling to occur.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/tree-log.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c17d6b827b42..375c4642f480 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1787,6 +1787,7 @@ static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans,
 			break;
 
 		if (ret == 1) {
+			ret = 0;
 			if (path->slots[0] == 0)
 				break;
 			path->slots[0]--;
@@ -1799,17 +1800,19 @@ static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans,
 
 		ret = btrfs_del_item(trans, root, path);
 		if (ret)
-			goto out;
+			break;
 
 		btrfs_release_path(path);
 		inode = read_one_inode(root, key.offset);
-		if (!inode)
-			return -EIO;
+		if (!inode) {
+			ret = -EIO;
+			break;
+		}
 
 		ret = fixup_inode_link_count(trans, root, inode);
 		iput(inode);
 		if (ret)
-			goto out;
+			break;
 
 		/*
 		 * fixup on a directory may create new entries,
@@ -1818,8 +1821,6 @@ static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans,
 		 */
 		key.offset = (u64)-1;
 	}
-	ret = 0;
-out:
 	btrfs_release_path(path);
 	return ret;
 }
-- 
cgit 1.4.1


From f96d44743a44e3332f75d23d2075bb8270900e1d Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 11:26:25 -0400
Subject: btrfs: check error value from btrfs_update_inode in tree log

Error injection testing uncovered a case where we ended up with invalid
link counts on an inode.  This happened because we failed to notice an
error when updating the inode while replaying the tree log, and
committed the transaction with an invalid file system.

Fix this by checking the return value of btrfs_update_inode.  This
resolved the link count errors I was seeing, and we already properly
handle passing up the error values in these paths.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/tree-log.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 375c4642f480..e4820e88cba0 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1574,7 +1574,9 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
 			if (ret)
 				goto out;
 
-			btrfs_update_inode(trans, root, BTRFS_I(inode));
+			ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
+			if (ret)
+				goto out;
 		}
 
 		ref_ptr = (unsigned long)(ref_ptr + ref_struct_size) + namelen;
@@ -1749,7 +1751,9 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
 
 	if (nlink != inode->i_nlink) {
 		set_nlink(inode, nlink);
-		btrfs_update_inode(trans, root, BTRFS_I(inode));
+		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
+		if (ret)
+			goto out;
 	}
 	BTRFS_I(inode)->index_cnt = (u64)-1;
 
-- 
cgit 1.4.1


From dc09ef3562726cd520c8338c1640872a60187af5 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 19 May 2021 14:04:21 -0400
Subject: btrfs: abort in rename_exchange if we fail to insert the second ref

Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange.  This happens because
we insert the inode ref for one side of the rename, and then for the
other side.  If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind.  Fix this by
aborting if we did the insert for the first inode ref.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

(limited to 'fs')

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e7de0c08b981..f5d32d85247a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9101,6 +9101,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 	int ret2;
 	bool root_log_pinned = false;
 	bool dest_log_pinned = false;
+	bool need_abort = false;
 
 	/* we only allow rename subvolume link between subvolumes */
 	if (old_ino != BTRFS_FIRST_FREE_OBJECTID && root != dest)
@@ -9160,6 +9161,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 					     old_idx);
 		if (ret)
 			goto out_fail;
+		need_abort = true;
 	}
 
 	/* And now for the dest. */
@@ -9175,8 +9177,11 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 					     new_ino,
 					     btrfs_ino(BTRFS_I(old_dir)),
 					     new_idx);
-		if (ret)
+		if (ret) {
+			if (need_abort)
+				btrfs_abort_transaction(trans, ret);
 			goto out_fail;
+		}
 	}
 
 	/* Update inode version and ctime/mtime. */
-- 
cgit 1.4.1


From ea7036de0d36c4e6c9508f68789e9567d514333a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Mon, 24 May 2021 11:35:53 +0100
Subject: btrfs: fix fsync failure and transaction abort after writes to
 prealloc extents

When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.

For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.

The following example reproduces the problem and explains how it happens:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  # Create our test file with 2 preallocated extents. Leave a 1M hole
  # between them to ensure that we get two file extent items that will
  # never be merged into a single one. The extents are contiguous on disk,
  # which will later result in the checksums for their data to be merged
  # into a single checksum item in the csums btree.
  #
  $ xfs_io -f \
           -c "falloc 0 1M" \
           -c "falloc 3M 3M" \
           /mnt/foobar

  # Now write to the second extent and leave only 1M of it as unwritten,
  # which corresponds to the file range [4M, 5M[.
  #
  # Then fsync the file to flush delalloc and to clear full sync flag from
  # the inode, so that a future fsync will use the fast code path.
  #
  # After the writeback triggered by the fsync we have 3 file extent items
  # that point to the second extent we previously allocated:
  #
  # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
  #    file range [3M, 4M[
  #
  # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
  #    the file range [4M, 5M[
  #
  # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
  #    file range [5M, 6M[
  #
  # All these file extent items have a generation of 6, which is the ID of
  # the transaction where they were created. The split of the original file
  # extent item is done at btrfs_mark_extent_written() when ordered extents
  # complete for the file ranges [3M, 4M[ and [5M, 6M[.
  #
  $ xfs_io -c "pwrite -S 0xab 3M 1M" \
           -c "pwrite -S 0xef 5M 1M" \
           -c "fsync" \
           /mnt/foobar

  # Commit the current transaction. This wipes out the log tree created by
  # the previous fsync.
  sync

  # Now write to the unwritten range of the second extent we allocated,
  # corresponding to the file range [4M, 5M[, and fsync the file, which
  # triggers the fast fsync code path.
  #
  # The fast fsync code path sees that there is a new extent map covering
  # the file range [4M, 5M[ and therefore it will log a checksum item
  # covering the range [1M, 2M[ of the second extent we allocated.
  #
  # Also, after the fsync finishes we no longer have the 3 file extent
  # items that pointed to 3 sections of the second extent we allocated.
  # Instead we end up with a single file extent item pointing to the whole
  # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
  # current transaction ID). This is due to the file extent item merging we
  # do when completing ordered extents into ranges that point to unwritten
  # (preallocated) extents. This merging is done at
  # btrfs_mark_extent_written().
  #
  $ xfs_io -c "pwrite -S 0xcd 4M 1M" \
           -c "fsync" \
           /mnt/foobar

  # Now do some write to our file outside the range of the second extent
  # that we allocated with fallocate() and truncate the file size from 6M
  # down to 5M.
  #
  # The truncate operation sets the full sync runtime flag on the inode,
  # forcing the next fsync to use the slow code path. It also changes the
  # length of the second file extent item so that it represents the file
  # range [3M, 5M[ and not the range [3M, 6M[ anymore.
  #
  # Finally fsync the file. Since this is a fsync that triggers the slow
  # code path, it will remove all items associated to the inode from the
  # log tree and then it will scan for file extent items in the
  # fs/subvolume tree that have a generation matching the current
  # transaction ID, which is 7. This means it will log 2 file extent
  # items:
  #
  # 1) One for the first extent we allocated, covering the file range
  #    [0, 1M[
  #
  # 2) Another for the first 2M of the second extent we allocated,
  #    covering the file range [3M, 5M[
  #
  # When logging the first file extent item we log a single checksum item
  # that has all the checksums for the entire extent.
  #
  # When logging the second file extent item, we also lookup for the
  # checksums that are associated with the range [0, 2M[ of the second
  # extent we allocated (file range [3M, 5M[), and then we log them with
  # btrfs_csum_file_blocks(). However that results in ending up with a log
  # that has two checksum items with ranges that overlap:
  #
  # 1) One for the range [1M, 2M[ of the second extent we allocated,
  #    corresponding to the file range [4M, 5M[, which we logged in the
  #    previous fsync that used the fast code path;
  #
  # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
  #    extents, respectively, corresponding to the files ranges [0, 1M[
  #    and [3M, 5M[. This one was added during this last fsync that uses
  #    the slow code path and overlaps with the previous one logged by
  #    the previous fast fsync.
  #
  # This happens because when logging the checksums for the second
  # extent, we notice they start at an offset that matches the end of the
  # checksums item that we logged for the first extent, and because both
  # extents are contiguous on disk, btrfs_csum_file_blocks() decides to
  # extend that existing checksums item and append the checksums for the
  # second extent to this item. The end result is we end up with two
  # checksum items in the log tree that have overlapping ranges, as
  # listed before, resulting in the fsync to fail with -EIO and aborting
  # the transaction, turning the filesystem into RO mode.
  #
  $ xfs_io -c "pwrite -S 0xff 0 1M" \
           -c "truncate 5M" \
           -c "fsync" \
           /mnt/foobar
  fsync: Input/output error

After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:

  $ dmesg
  (...)
  [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
  [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
  [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
  [756289.563654] 	item 0 key (257 1 0) itemoff 16123 itemsize 160
  [756289.564649] 		inode generation 6 size 5242880 mode 100600
  [756289.565636] 	item 1 key (257 12 256) itemoff 16107 itemsize 16
  [756289.566694] 	item 2 key (257 108 0) itemoff 16054 itemsize 53
  [756289.567725] 		extent data disk bytenr 13631488 nr 1048576
  [756289.568697] 		extent data offset 0 nr 1048576 ram 1048576
  [756289.569689] 	item 3 key (257 108 1048576) itemoff 16001 itemsize 53
  [756289.570682] 		extent data disk bytenr 0 nr 0
  [756289.571363] 		extent data offset 0 nr 2097152 ram 2097152
  [756289.572213] 	item 4 key (257 108 3145728) itemoff 15948 itemsize 53
  [756289.573246] 		extent data disk bytenr 14680064 nr 3145728
  [756289.574121] 		extent data offset 0 nr 2097152 ram 3145728
  [756289.574993] 	item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
  [756289.576113] 	item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
  [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
  [756289.578644] ------------[ cut here ]------------
  [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
  [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G        W         5.12.0-rc8-btrfs-next-87 #1
  [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [756289.595122] Code: 5d c3 e8 76 60 (...)
  [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
  [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
  [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
  [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
  [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
  [756289.602278] FS:  00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
  [756289.603217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
  [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [756289.606400] Call Trace:
  [756289.606704]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
  [756289.607313]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
  [756289.608040]  submit_one_bio+0x61/0x70 [btrfs]
  [756289.608587]  btree_write_cache_pages+0x587/0x610 [btrfs]
  [756289.609258]  ? free_debug_processing+0x1d5/0x240
  [756289.609812]  ? __module_address+0x28/0xf0
  [756289.610298]  ? lock_acquire+0x1a0/0x3e0
  [756289.610754]  ? lock_acquired+0x19f/0x430
  [756289.611220]  ? lock_acquire+0x1a0/0x3e0
  [756289.611675]  do_writepages+0x43/0xf0
  [756289.612101]  ? __filemap_fdatawrite_range+0xa4/0x100
  [756289.612800]  __filemap_fdatawrite_range+0xc5/0x100
  [756289.613393]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
  [756289.614085]  btrfs_sync_log+0x21c/0xf20 [btrfs]
  [756289.614661]  ? finish_wait+0x90/0x90
  [756289.615096]  ? __mutex_unlock_slowpath+0x45/0x2a0
  [756289.615661]  ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
  [756289.616338]  ? lock_acquire+0x1a0/0x3e0
  [756289.616801]  ? lock_acquired+0x19f/0x430
  [756289.617284]  ? lock_acquire+0x1a0/0x3e0
  [756289.617750]  ? lock_release+0x214/0x470
  [756289.618221]  ? lock_acquired+0x19f/0x430
  [756289.618704]  ? dput+0x20/0x4a0
  [756289.619079]  ? dput+0x20/0x4a0
  [756289.619452]  ? lockref_put_or_lock+0x9/0x30
  [756289.619969]  ? lock_release+0x214/0x470
  [756289.620445]  ? lock_release+0x214/0x470
  [756289.620924]  ? lock_release+0x214/0x470
  [756289.621415]  btrfs_sync_file+0x46a/0x5b0 [btrfs]
  [756289.621982]  do_fsync+0x38/0x70
  [756289.622395]  __x64_sys_fsync+0x10/0x20
  [756289.622907]  do_syscall_64+0x33/0x80
  [756289.623438]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [756289.624063] RIP: 0033:0x7f08b27fbb7b
  [756289.624588] Code: 0f 05 48 3d 00 (...)
  [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
  [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
  [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
  [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
  [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
  [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
  [756289.631819] irq event stamp: 0
  [756289.632188] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
  [756289.633893] softirqs last  enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
  [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
  [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
  [756289.637082] BTRFS info (device sdc): forced readonly

Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:

  27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
  b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
  40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
  e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")

Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.

When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.

This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.

A test case for fstests will follow soon.

Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/file-item.c | 98 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 76 insertions(+), 22 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a5a8dac334e8..441cee7fbb62 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -923,6 +923,37 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int find_next_csum_offset(struct btrfs_root *root,
+				 struct btrfs_path *path,
+				 u64 *next_offset)
+{
+	const u32 nritems = btrfs_header_nritems(path->nodes[0]);
+	struct btrfs_key found_key;
+	int slot = path->slots[0] + 1;
+	int ret;
+
+	if (nritems == 0 || slot >= nritems) {
+		ret = btrfs_next_leaf(root, path);
+		if (ret < 0) {
+			return ret;
+		} else if (ret > 0) {
+			*next_offset = (u64)-1;
+			return 0;
+		}
+		slot = path->slots[0];
+	}
+
+	btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
+
+	if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+	    found_key.type != BTRFS_EXTENT_CSUM_KEY)
+		*next_offset = (u64)-1;
+	else
+		*next_offset = found_key.offset;
+
+	return 0;
+}
+
 int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums)
@@ -938,7 +969,6 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 	u64 total_bytes = 0;
 	u64 csum_offset;
 	u64 bytenr;
-	u32 nritems;
 	u32 ins_size;
 	int index = 0;
 	int found_next;
@@ -981,26 +1011,10 @@ again:
 			goto insert;
 		}
 	} else {
-		int slot = path->slots[0] + 1;
-		/* we didn't find a csum item, insert one */
-		nritems = btrfs_header_nritems(path->nodes[0]);
-		if (!nritems || (path->slots[0] >= nritems - 1)) {
-			ret = btrfs_next_leaf(root, path);
-			if (ret < 0) {
-				goto out;
-			} else if (ret > 0) {
-				found_next = 1;
-				goto insert;
-			}
-			slot = path->slots[0];
-		}
-		btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
-		if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
-		    found_key.type != BTRFS_EXTENT_CSUM_KEY) {
-			found_next = 1;
-			goto insert;
-		}
-		next_offset = found_key.offset;
+		/* We didn't find a csum item, insert one. */
+		ret = find_next_csum_offset(root, path, &next_offset);
+		if (ret < 0)
+			goto out;
 		found_next = 1;
 		goto insert;
 	}
@@ -1056,8 +1070,48 @@ extend_csum:
 		tmp = sums->len - total_bytes;
 		tmp >>= fs_info->sectorsize_bits;
 		WARN_ON(tmp < 1);
+		extend_nr = max_t(int, 1, tmp);
+
+		/*
+		 * A log tree can already have checksum items with a subset of
+		 * the checksums we are trying to log. This can happen after
+		 * doing a sequence of partial writes into prealloc extents and
+		 * fsyncs in between, with a full fsync logging a larger subrange
+		 * of an extent for which a previous fast fsync logged a smaller
+		 * subrange. And this happens in particular due to merging file
+		 * extent items when we complete an ordered extent for a range
+		 * covered by a prealloc extent - this is done at
+		 * btrfs_mark_extent_written().
+		 *
+		 * So if we try to extend the previous checksum item, which has
+		 * a range that ends at the start of the range we want to insert,
+		 * make sure we don't extend beyond the start offset of the next
+		 * checksum item. If we are at the last item in the leaf, then
+		 * forget the optimization of extending and add a new checksum
+		 * item - it is not worth the complexity of releasing the path,
+		 * getting the first key for the next leaf, repeat the btree
+		 * search, etc, because log trees are temporary anyway and it
+		 * would only save a few bytes of leaf space.
+		 */
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+			if (path->slots[0] + 1 >=
+			    btrfs_header_nritems(path->nodes[0])) {
+				ret = find_next_csum_offset(root, path, &next_offset);
+				if (ret < 0)
+					goto out;
+				found_next = 1;
+				goto insert;
+			}
+
+			ret = find_next_csum_offset(root, path, &next_offset);
+			if (ret < 0)
+				goto out;
+
+			tmp = (next_offset - bytenr) >> fs_info->sectorsize_bits;
+			if (tmp <= INT_MAX)
+				extend_nr = min_t(int, extend_nr, tmp);
+		}
 
-		extend_nr = max_t(int, 1, (int)tmp);
 		diff = (csum_offset + extend_nr) * csum_size;
 		diff = min(diff,
 			   MAX_CSUM_ITEMS(fs_info, csum_size) * csum_size);
-- 
cgit 1.4.1


From 76a6d5cd74479e7ec8a7f9a29bce63d5549b6b2e Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 25 May 2021 11:05:28 +0100
Subject: btrfs: fix deadlock when cloning inline extents and low on available
 space

There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:

1) When starting writeback for an inode with a very small dirty range that
   fits in an inline extent, we deadlock during the writeback when trying
   to insert the inline extent, at cow_file_range_inline(), if the extent
   is going to be located in the leaf for which we are already holding a
   read lock;

2) After successfully starting writeback, for non-inline extent cases,
   the async reclaim thread will hang waiting for an ordered extent to
   complete if the ordered extent completion needs to modify the leaf
   for which the clone task is holding a read lock (for adding or
   replacing file extent items). So the cloning task will wait forever
   on the async reclaim thread to make progress, which in turn is
   waiting for the ordered extent completion which in turn is waiting
   to acquire a write lock on the same leaf.

So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.

Fixes: 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/reflink.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

(limited to 'fs')

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 06682128d8fa..58ddc7ed9e84 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -207,10 +207,7 @@ static int clone_copy_inline_extent(struct inode *dst,
 			 * inline extent's data to the page.
 			 */
 			ASSERT(key.offset > 0);
-			ret = copy_inline_to_page(BTRFS_I(dst), new_key->offset,
-						  inline_data, size, datal,
-						  comp_type);
-			goto out;
+			goto copy_to_page;
 		}
 	} else if (i_size_read(dst) <= datal) {
 		struct btrfs_file_extent_item *ei;
@@ -226,13 +223,10 @@ static int clone_copy_inline_extent(struct inode *dst,
 		    BTRFS_FILE_EXTENT_INLINE)
 			goto copy_inline_extent;
 
-		ret = copy_inline_to_page(BTRFS_I(dst), new_key->offset,
-					  inline_data, size, datal, comp_type);
-		goto out;
+		goto copy_to_page;
 	}
 
 copy_inline_extent:
-	ret = 0;
 	/*
 	 * We have no extent items, or we have an extent at offset 0 which may
 	 * or may not be inlined. All these cases are dealt the same way.
@@ -244,11 +238,13 @@ copy_inline_extent:
 		 * clone. Deal with all these cases by copying the inline extent
 		 * data into the respective page at the destination inode.
 		 */
-		ret = copy_inline_to_page(BTRFS_I(dst), new_key->offset,
-					  inline_data, size, datal, comp_type);
-		goto out;
+		goto copy_to_page;
 	}
 
+	/*
+	 * Release path before starting a new transaction so we don't hold locks
+	 * that would confuse lockdep.
+	 */
 	btrfs_release_path(path);
 	/*
 	 * If we end up here it means were copy the inline extent into a leaf
@@ -285,11 +281,6 @@ copy_inline_extent:
 	ret = btrfs_inode_set_file_extent_range(BTRFS_I(dst), 0, aligned_end);
 out:
 	if (!ret && !trans) {
-		/*
-		 * Release path before starting a new transaction so we don't
-		 * hold locks that would confuse lockdep.
-		 */
-		btrfs_release_path(path);
 		/*
 		 * No transaction here means we copied the inline extent into a
 		 * page of the destination inode.
@@ -310,6 +301,21 @@ out:
 		*trans_out = trans;
 
 	return ret;
+
+copy_to_page:
+	/*
+	 * Release our path because we don't need it anymore and also because
+	 * copy_inline_to_page() needs to reserve data and metadata, which may
+	 * need to flush delalloc when we are low on available space and
+	 * therefore cause a deadlock if writeback of an inline extent needs to
+	 * write to the same leaf or an ordered extent completion needs to write
+	 * to the same leaf.
+	 */
+	btrfs_release_path(path);
+
+	ret = copy_inline_to_page(BTRFS_I(dst), new_key->offset,
+				  inline_data, size, datal, comp_type);
+	goto out;
 }
 
 /**
-- 
cgit 1.4.1