Generate random directories/files given number of files and depth
I'd like to profile some VCS software, and to do so I want to generate a set of random files, in randomly arranged directories. I'm writing the script in Python, but my question is briefly: how do I generate a random directory tree with an average number of sub-directories per directory and some broad distribution of files per directory?
Clarification: I'm not comparing different VCS repo formats (eg. SVN vs Git vs Hg), but profiling software that deals with SVN (and eventually other) working copies and repos.
The constraints I'd like are to specify the total number of files (call it 'N', probably ~10k-100k) and the maximum depth of the directory structure ('L', probably 2-10). I don't care how many directories are generated at each level, and I don't want to end up with 1 file per dir, or 100k all in one dir.
Th开发者_开发技巧e distribution is something I'm not sure about, since I don't know whether VCS' (SVN in particular) would perform better or worse with a very uniform structure or a very skewed structure. Nonetheless, it would be nice if I could come up with an algorithm that didn't "even out" for large numbers.
My first thoughts were: generate the directory tree using some method, and then uniformly populate the tree with files (treating each dir equally, with no regard as to nesting). My back-of-the-envelope calcs tell me that if there are 'L' levels, with 'D' subdirs per dir, and about sqrt(N) files per dir, then there will be about D^L dirs, so N =~ sqrt(N)*(D^L) => D =~ N^(1/2L). So now I have an approximate value for 'D', how can I generate the tree? How do I populate the files?
I'd be grateful just for some pointers to good resources on algorithms I could use. My searching only found pretty applets/flash.
Why not download some real open source repos and use those?
Have you thought about what goes into the files? is that random data too?
I recently wrote a small python package randomfiletree
, which generates a random file/directory structure. The code and manual are on https://github.com/klieret/randomfiletree.
The algorithm traverses an existing file tree and creates a number of files and directories in each subfolder based on a Gaussian with certain width and expectation value. Then this process is repeated.
It basically uses something like this:
def create_random_tree(basedir, nfiles=2, nfolders=1, repeat=1,
maxdepth=None, sigma_folders=1, sigma_files=1):
"""
Create a random set of files and folders by repeatedly walking through the
current tree and creating random files or subfolders (the number of files
and folders created is chosen from a Gaussian distribution).
Args:
basedir: Directory to create files and folders in
nfiles: Average number of files to create
nfolders: Average number of folders to create
repeat: Walk this often through the directory tree to create new
subdirectories and files
maxdepth: Maximum depth to descend into current file tree. If None,
infinity.
sigma_folders: Spread of number of folders
sigma_files: Spread of number of files
Returns:
(List of dirs, List of files), all as pathlib.Path objects.
"""
alldirs = []
allfiles = []
for i in range(repeat):
for root, dirs, files in os.walk(str(basedir)):
for _ in range(int(random.gauss(nfolders, sigma_folders))):
p = Path(root) / random_string()
p.mkdir(exist_ok=True)
alldirs.append(p)
for _ in range(int(random.gauss(nfiles, sigma_files))):
p = Path(root) / random_string()
p.touch(exist_ok=True)
allfiles.append(p)
depth = os.path.relpath(root, str(basedir)).count(os.sep)
if maxdepth and depth >= maxdepth - 1:
del dirs[:]
alldirs = list(set(alldirs))
allfiles = list(set(allfiles))
return alldirs, allfiles
This is a pretty quick-and-dirty approach, but one could also develop this module further if there is interest.
Your question is fairly long and involved, but I think it boils down to asking for a random number generator with certain statistical properties.
If you don't like python's random number generator, you might look at some of the other statistical packages on pypi, or if you want something a little more heavy duty, perhaps the python bindings for the GNU Scientific Library.
http://sourceforge.net/projects/pygsl/
http://www.gnu.org/software/gsl/
I had similar needs, so I created a Rust crate that reproducibly generates a random directory structure with a focus on being maximally performant (it can create 1.5 million files in under 2 seconds on my machine). Output is gaussian and is configured with a target number of total files, a max depth, and a target file to dir ratio (i.e. N files per directory).
File Tree Fuzzer: https://github.com/SUPERCILEX/ftzz
精彩评论