summaryrefslogtreecommitdiff
path: root/fs/f2fs/node.h
AgeCommit message (Collapse)Author
2016-11-23f2fs: split free nid listChao Yu
During free nid allocation, in order to do preallocation, we will tag free nid entry as allocated one and still leave it in free nid list, for other allocators who want to grab free nids, it needs to traverse the free nid list for lookup. It becomes overhead in scenario of allocating free nid intensively by multithreads. This patch splits free nid list to two list: {free,alloc}_nid_list, to keep free nids and preallocated free nids separately, after that, traverse latency will be gone, besides split nid_cnt for separate statistic. Additionally, introduce __insert_nid_to_list and __remove_nid_from_list for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: modify f2fs_bug_on to avoid needless branches] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23f2fs: fix sparse warningsEric Biggers
f2fs contained a number of endianness conversion bugs. Also, one function should have been 'static'. Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/' Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30f2fs: introduce cp_lock to protect updating of ckpt_flagsChao Yu
This patch introduces spinlock to protect updating process of ckpt_flags field in struct f2fs_checkpoint, it avoids incorrectly updating in race condition. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30f2fs: use crc and cp version to determine roll-forward recoveryJaegeuk Kim
Previously, we used cp_version only to detect recoverable dnodes. In order to avoid same garbage cp_version, we needed to truncate the next dnode during checkpoint, resulting in additional discard or data write. If we can distinguish this by using crc in addition to cp_version, we can remove this overhead. There is backward compatibility concern where it changes node_footer layout. So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to detect new layout. New layout will be activated only when this flag is set. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-06f2fs: produce more nids and reduce readahead natsJaegeuk Kim
The readahead nat pages are more likely to be reclaimed quickly, so it'd better to gather more free nids in advance. And, let's keep some free nids as much as possible. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-06-07f2fs: control not to exceed # of cached nat entriesJaegeuk Kim
This is to avoid cache entry management overhead including radix tree. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-06-07f2fs: fix wrong percentageJaegeuk Kim
This should be 1%, 10MB / 1GB. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: use wait_for_stable_page to avoid contentionJaegeuk Kim
In write_begin, if storage supports stable_page, we don't need to wait for writeback to update its contents. This patch introduces to use wait_for_stable_page instead of wait_on_page_writeback. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: avoid multiple node page writes due to inline_dataJaegeuk Kim
The sceanrio is: 1. create fully node blocks 2. flush node blocks 3. write inline_data for all the node blocks again 4. flush node blocks redundantly So, this patch tries to flush inline_data when flushing node blocks. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: export dirty_nats_ratio in sysfsChao Yu
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold of dirty nat entries, if current ratio exceeds configured threshold, checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: flush dirty nat entries when exceeding thresholdChao Yu
When testing f2fs with xfstest, generic/251 is stuck for long time, the case uses below serials to obtain fresh released space in device, in order to prepare for following fstrim test. 1. rm -rf /mnt/dir 2. mkdir /mnt/dir/ 3. cp -axT `pwd`/ /mnt/dir/ 4. goto 1 During preparing step, all nat entries will be cached in nat cache, most of them are dirty entries with invalid blkaddr, which means nodes related to these entries have been truncated, and they could be reused after the dirty entries been checkpointed. However, there was no checkpoint been triggered, so nid allocators (e.g. mkdir, creat) will run into long journey of iterating all NAT pages, looking for free nids in alloc_nid->build_free_nids. Here, in f2fs_balance_fs_bg we give another chance to do checkpoint to flush nat entries for reusing them in free nid cache when dirty entry count exceeds 10% of max count. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: avoid unnecessary f2fs_balance_fs callsJaegeuk Kim
Only when node page is newly dirtied, it needs to check whether we need to do f2fs_gc. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-04f2fs: use sbi->blocks_per_seg to avoid unnecessary calculationChao Yu
Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using 1 << sbi->log_blocks_per_seg. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12f2fs: export ra_nid_pages to sysfsChao Yu
After finishing building free nid cache, we will try to readahead asynchronously 4 more pages for the next reloading, the count of readahead nid pages is fixed. In some case, like SMR drive, read less sectors with fixed count each time we trigger RA may be low efficient, since we will face high seeking overhead, so we'd better let user to configure this parameter from sysfs in specific workload. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12Revert "f2fs: do not skip dentry block writes"Jaegeuk Kim
The periodic checkpoint can resolve the previous issue. So, now we can use this again to improve the reported performance regression: https://lkml.org/lkml/2015/10/8/20 This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
2015-10-09f2fs: do not skip dentry block writesJaegeuk Kim
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory pressure and the number of dirty pages is pretty small. But, we didn't skip for normal data writes, which gives us not much big impact on overall performance. Moreover, by skipping some data writes, kworker falls into infinite loop to try to write blocks, when many dir inodes have only one dentry block. So, this patch removes skipping data writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-05-28f2fs: move existing definitions into f2fs.hJaegeuk Kim
This patch moves some inode-related definitions from node.h to f2fs.h to add new features. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-03-03f2fs: introduce infra macro and data structure of rb-tree extent cacheChao Yu
Introduce infra macro and data structure for rb-tree based extent cache: Macros: * EXT_TREE_VEC_SIZE: indicate vector size for gang lookup in extent tree. * F2FS_MIN_EXTENT_LEN: indicate minimum length of extent managed in cache. * EXTENT_CACHE_SHRINK_NUMBER: indicate number of extent in cache will be shrunk. Basic data structures for extent cache: * struct extent_tree: extent tree entry per inode. * struct extent_node: extent info node linked in extent tree. Besides, adding new extent cache related fields in f2fs_sb_info. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09f2fs: free radix_tree_nodes used by nat_set entriesJaegeuk Kim
In the normal case, the radix_tree_nodes are freed successfully. But, when cp_error was detected, we should destroy them forcefully. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09f2fs: fix missing cold bit during recoveryJaegeuk Kim
In do_recover_data, we find and update previous node pages after updating its new block addresses. After then, we call fill_node_footer without reset field, we erase its cold bit so that this new cold node block is written to wrong log area. This patch fixes not to miss its old flag. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09f2fs: merge two uchar variable in struct node_info to reduce memory costChao Yu
This patch moves one member of struct nat_entry: _flag_ to struct node_info, so _version_ in struct node_info and _flag_ which are unsigned char type will merge to one 32-bit space in register/memory. So the size of nat_entry will be reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40 bytes to 32 bytes) and then slab memory using by f2fs will be reduced. changes from v2: o update description of memory usage gain for 64-bit machine suggested by Changman Lee. changes from v1: o introduce inline copy_node_info() to copy valid data from node info suggested by Jaegeuk Kim, it can avoid bug. Reviewed-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09f2fs: change atomic and volatile write policiesJaegeuk Kim
This patch adds two new ioctls to release inmemory pages grabbed by atomic writes. o f2fs_ioc_abort_volatile_write - If transaction was failed, all the grabbed pages and data should be written. o f2fs_ioc_release_volatile_write - This is to enhance the performance of PERSIST mode in sqlite. In order to avoid huge memory consumption which causes OOM, this patch changes volatile writes to use normal dirty pages, instead blocked flushing to the disk as long as system does not suffer from memory pressure. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-06f2fs: control the memory footprint used by ino entriesJaegeuk Kim
This patch adds to control the memory footprint used by ino entries. This will conduct best effort, not strictly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03f2fs: introduce f2fs_change_bit to simplify the change bit logicGu Zheng
Introduce f2fs_change_bit to simplify the change bit logic in function set_to_next_nat{sit}. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-05f2fs: remove unused return valueJaegeuk Kim
Don't return any value without any usage. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: refactor flush_nat_entries to remove costly reorganizing opsJaegeuk Kim
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets according to its nid ranges. This can improve the flushing nat pages, however, if there are a lot of cached nat entries, it becomes a bottleneck. This patch introduces a new set management flow by removing dirty nat list and adding a series of set operations when the nat entry becomes dirty. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23f2fs: fix conditions to remain recovery information in f2fs_sync_fileJaegeuk Kim
This patch revisited whole the recovery information during the f2fs_sync_file. In this patch, there are three information to make a decision. a) IS_CHECKPOINTED, /* is it checkpointed before? */ b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */ c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */ And, the scenarios for our rule are based on: [Term] F: fsync_mark, D: dentry_mark 1. inode(x) | CP | inode(x) | dnode(F) 2. inode(x) | CP | inode(F) | dnode(F) 3. inode(x) | CP | dnode(F) | inode(x) | inode(F) 4. inode(x) | CP | dnode(F) | inode(F) 5. CP | inode(x) | dnode(F) | inode(DF) 6. CP | inode(DF) | dnode(F) 7. CP | dnode(F) | inode(DF) 8. CP | dnode(F) | inode(x) | inode(DF) For example, #3, the three conditions should be changed as follows. inode(x) | CP | dnode(F) | inode(x) | inode(F) a) x o o o o b) x x x x o c) x o o x o If f2fs_sync_file stops ------^, it should write inode(F) --------------^ So, the need_inode_block_update should return true, since c) get_nat_flag(e, HAS_LAST_FSYNC), is false. For example, #8, CP | alloc | dnode(F) | inode(x) | inode(DF) a) o x x x x b) x x x o c) o o x o If f2fs_sync_file stops -------^, it should write inode(DF) --------------^ Note that, the roll-forward policy should follow this rule, which means, if there are any missing blocks, we doesn't need to recover that inode. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23f2fs: introduce a flag to represent each nat entry informationJaegeuk Kim
This patch introduces a flag in the nat entry structure to merge various information such as checkpointed and fsync_done marks. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16f2fs: fix a race condition in next_free_nidHuang Ying
The nm_i->fcnt checking is executed before spin_lock, so if another thread delete the last free_nid from the list, the wrong nid may be gotten. So fix the race condition by moving the nm_i->fnct checking into spin_lock. Signed-off-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-03f2fs: introduce F2FS_I_SB, F2FS_M_SB, and F2FS_P_SBJaegeuk Kim
This patch adds three inline functions to clean up dirty casting codes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09f2fs: refactor flush_nat_entries codes for reducing NAT writesChao Yu
Although building NAT journal in cursum reduce the read/write work for NAT block, but previous design leave us lower performance when write checkpoint frequently for these cases: 1. if journal in cursum has already full, it's a bit of waste that we flush all nat entries to page for persistence, but not to cache any entries. 2. if journal in cursum is not full, we fill nat entries to journal util journal is full, then flush the left dirty entries to disk without merge journaled entries, so these journaled entries may be flushed to disk at next checkpoint but lost chance to flushed last time. In this patch we merge dirty entries located in same NAT block to nat entry set, and linked all set to list, sorted ascending order by entries' count of set. Later we flush entries in sparse set into journal as many as we can, and then flush merged entries to disk. In this way we can not only gain in performance, but also save lifetime of flash device. In my testing environment, it shows this patch can help to reduce NAT block writes obviously. In hard disk test case: cost time of fsstress is stablely reduced by about 5%. 1. virtual machine + hard disk: fsstress -p 20 -n 200 -l 5 node num cp count nodes/cp based 4599.6 1803.0 2.551 patched 2714.6 1829.6 1.483 2. virtual machine + 32g micro SD card: fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0 -f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5 -f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S node num cp count nodes/cp based 84.5 43.7 1.933 patched 49.2 40.0 1.23 Our latency of merging op shows not bad when handling extreme case like: merging a great number of dirty nats: latency(ns) dirty nat count 3089219 24922 5129423 27422 4000250 24523 change log from v1: o fix wrong logic in add_nat_entry when grab a new nat entry set. o swith to create slab cache in create_node_manager_caches. o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency. change log from v2: o make comment position more appropriate suggested by Jaegeuk Kim. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-05-07f2fs: fix checkpatch warningZhang Zhen
fix the following checkpatch warning: WARNING: do {} while (0) macros should not be semicolon terminated Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07f2fs: split grab_cache_page and wait_on_page_writeback for node pagesJaegeuk Kim
This patch splits grab_cache_page_write_begin into grab_cache_page and wait_on_page_writeback for node pages. This patch intends to enhance the latency to get node pages by alleviating unnecessary wait_on_page_writeback. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07f2fs: adjust free mem size to flush dentry blocksJaegeuk Kim
If so many dirty dentry blocks are cached, not reached to the flush condition, we should fall into livelock in balance_dirty_pages. So, let's consider the mem size for the condition. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07f2fs: introduce raw_nat_from_node_info() to simplfy codesChao Yu
This patch introduce raw_nat_from_node_info() to simplfy some codes, and also use exist function node_info_from_raw_nat() to do the same job. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20f2fs: skip unnecessary node writes during fsyncJaegeuk Kim
If multiple redundant fsync calls are triggered, we don't need to write its node pages with fsync mark continuously. So, this patch adds FI_NEED_FSYNC to track whether the latest node block is written with the fsync mark or not. If the mark was set, a new fsync doesn't need to write a node block. Otherwise, we should do a new node block with the mark for roll-forward recovery. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20f2fs: remove unnecessary thresholdJaegeuk Kim
The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis of the memory footprint. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20f2fs: throttle the memory footprint with a sysfs entryJaegeuk Kim
This patch introduces ram_thresh, a sysfs entry, which controls the memory footprint used by the free nid list and the nat cache. Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat cache was managed by NM_WOUT_THRESHOLD. However, this approach cannot be applied dynamically according to the system. So, this patch adds ram_thresh that users can specify the threshold, which is in order of 1 / 1024. For example, if the total ram size is 4GB and the value is set to 10 by default, f2fs tries to control the number of free nids and nat caches not to consume over 10 * (4GB / 1024) = 10MB. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18f2fs: introduce f2fs_has_xattr_block for better readabilityChao Yu
This patch introduces a help function f2fs_has_xattr_block for better readability. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24f2fs: fix to mark the checkpointed nat entry correctlyJaegeuk Kim
The nat cache entry maintains a status whether it is checkpointed or not. So, if a new cache entry is loaded from the last checkpoint, nat_entry->checkpointed should be true. If the cache entry is modified as being dirty, nat_entry->checkpoint should be false. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23f2fs: update several commentsChao Yu
Update several comments: 1. use f2fs_{un}lock_op install of mutex_{un}lock_op. 2. update comment of get_data_block(). 3. update description of node offset. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-09f2fs: fix the use of XATTR_NODE_OFFSETJaegeuk Kim
This patch fixes the use of XATTR_NODE_OFFSET. o The offset should not use several MSB bits which are used by marking node blocks. o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality during the fsync call. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-30f2fs: introduce help function F2FS_NODE()Gu Zheng
Introduce help function F2FS_NODE() to simplify the conversion of node_page to f2fs_node. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14f2fs: recover wrong pino after checkpoint during fsyncJaegeuk Kim
If a file is linked, f2fs loose its parent inode number so that fsync calls for the linked file should do checkpoint all the time. But, if we can recover its parent inode number after the checkpoint, we can adjust roll-forward mechanism for the further fsync calls, which is able to improve the fsync performance significatly. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: optimize several routines in node.hNamjae Jeon
There are various functions with common code which could be separated out to make common routines. So, made new routines and in order to retain the same call path and no major changes, written some macros to access those routines. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-09f2fs: fix the logic of IS_DNODE()Zhihui Zhang
If (ofs % (NIDS_PER_BLOCK + 1) == 0), the node is an indirect node block. Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: remove redundant lock_page callsJaegeuk Kim
In get_node_page, we do not need to call lock_page all the time. If the node page is cached as uptodate, 1. grab_cache_page locks the page, 2. read_node_page unlocks the page, and 3. lock_page is called for further process. Let's avoid this. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-03-27f2fs: fix to give correct parent inode number for roll forwardJaegeuk Kim
When we recover fsync'ed data after power-off-recovery, we should guarantee that any parent inode number should be correct for each direct inode blocks. So, let's make the following rules. - The fsync should do checkpoint to all the inodes that were experienced hard links. - So, the only normal files can be recovered by roll-forward. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-11f2fs: adjust kernel coding styleJaegeuk Kim
As pointed out by Randy Dunlap, this patch removes all usage of "/**" for comment blocks. Instead, just use "/*". Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-11f2fs: fix endian conversion bugs reported by sparseJaegeuk Kim
This patch should resolve the bugs reported by the sparse tool. Initial reports were written by "kbuild test robot" managed by fengguang.wu. In my local machines, I've tested also by running: > make C=2 CF="-D__CHECK_ENDIAN__" Accordingly, I've found lots of warnings and bugs related to the endian conversion. And I've fixed all at this moment. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>