Anybody know a good way to manage 1.5 million+ files?

**desnudopenguino** · 08-11-2012, 07:26 AM

If you're looking for something to fit the scope of multiple different formats not all supported one or two simple applications, you could:

1. look into an enterprise level data management application (probably really expensive)
2. have someone/yourself build you a custom application (less expensive but may take a while)
3. use good file management methods and build a better, forward moving, file structure

You could use something from the adobe suite to catalog part of your dataset, but i'm sure it won't get all of it. and the mysql (I would assume that's what it is from the discussions had here thus far) back-end will crap out eventually because it's most likely not tuned for huge sets of data. To have something like this happen automagically, you will probably have to invest quite a bit.

I could build a database that would handle the metadata (keywords,filetype,date created,edited, etc...) that pointed to your files, but there would have to be a manual input of this metadata. Though you could be able to group files and input groups of them with the same metadata so that wouldn't take as long, then if you had more notable files, you could add something to those individually.

The best thing I would suggest though would be to build a file structure that breaks these files down into meaningful sub-groups, and maintain it, and stick to it. At this point, there is probably either going to be a lot of time invested into this project, or a lot of money.

**Andreas** · 09-04-2012, 09:41 AM

Originally Posted by alpha754293

Anybody know of a good way to manage 1.5 million+ files?
They all vary in size and distribution and type and format. There's no standard naming convention.
Hopefully there'd be a way to automatically index what's on the array, register the file into a database, and make searching for a file a lot faster.
Thoughts/suggestions?

1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder

)

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

**alpha754293** · 09-07-2012, 03:58 PM

Thanks! I'll have to take a look at that.

*edit*

I just downloaded the Windows Server 2012 RC and giving it a shot. I was also reading from the link that you sent that it processes about 100 GB/hour and right now I've used about 9.7 TiB so which means it'll need at least 9.7 hours for it to go through the entire array of data that I've got already.

But it seems like it's a nifty little tool.

However, sometimes (quite often); I'm not necessarily duplication bound because - for example - when I am running multiple simulations where I am trying stuff out - it's actually quite easy for me to exceed the path-size limitations of most FSes.

And sometimes, the changes can be very subtle.

But we'll have to see how that goes. Thanks for the info though. Never knew about it.

*edit*
9 hours 46 minutes later, and it's only processed 646 GB out of 9.7 TB (or about 3% or so). This is definitely MUCH slower than the 100 GB/hour. (The array itself is able to read/write at some 115 MB/s, but the poor little old Xeon 2.8 GHz might be struggling to keep up with it.)

**jimba86** · 09-07-2012, 04:11 PM

put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P

**alpha754293** · 09-07-2012, 05:29 PM

Originally Posted by jimba86

put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P

Well...unfortunately, a) I don't have access to prototype 4 TB and 6 TB drives. And b) I have a friend who does and it has a SERIOUS issue with them right now (that every time the head sweeps over, it flips the polarity. Bad. Apparently, he also said that that's a firmware issue. *shrug*)

So, other than that, I'm already using 3 TB drives in the array.

But it isn't SPACE that's the issue. It's "how fast can I get the system to locate/find/retrieve?" something for me. (And no I/O either). If you've ever had to manually sort through > 1 million files, you'll know/learn very quickly what I am talking about. Considering just building the PLAINTEXT index of all of the files that are on the array takes just under 3.5 minutes and the resulting file is already at 131 MB (!) and looking for a file in that is already considerably faster than actually searching the array itself.

**alpha754293** · 09-08-2012, 06:05 PM

Currently running DeDupEval processing about 50 GB/hour, which means that it'll take an estimate 194 hours to finish going through my 9.7 TB of data already on the array.

**tiro_uspsss** · 09-13-2012, 01:22 AM

Originally Posted by Andreas

1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder

)

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

where is the download link to that deduptool program?

**alpha754293** · 09-14-2012, 06:36 PM

Originally Posted by tiro_uspsss

where is the download link to that deduptool program?

There isn't one.

**tived** · 09-14-2012, 07:35 PM

i think you have to sign up and down WinSrv2012 to be able to play with this

btw when will it be released?

Henrik

**tiro_uspsss** · 09-15-2012, 02:56 AM

Originally Posted by tived

i think you have to sign up and down WinSrv2012 to be able to play with this

btw when will it be released?

Henrik

oh I see, it's part of WS12

WS12 has already been released!

**tived** · 09-15-2012, 06:26 PM

Between us we make a great team ;-)
thanks

Originally Posted by tiro_uspsss

oh I see, it's part of WS12

WS12 has already been released!

**alpha754293** · 09-20-2012, 02:50 AM

Took nearly a week to scan my 9.7 TB of data; said that I would save about 40% with dedup. I'm waiting to back up all the data before running and going ahead with it.

**tiro_uspsss** · 10-10-2012, 03:50 AM

Originally Posted by Andreas

1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder

)

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

that deduptool... can it scan/run/do HDDs that are network shares?

**alpha754293** · 10-10-2012, 06:45 PM

Originally Posted by tiro_uspsss

that deduptool... can it scan/run/do HDDs that are network shares?

It probably can if you have it mapped as a network drive. I wouldn't recommend it though because it has to read it block-by-block so unless you're using some kind of high speed interconnect like Infiniband or 10Gbps-over-Ethernet network; it's probably not recommend.

Scanning a 27 TB RAID5 array took nearly a week on the local system. You can do the math to figure out what's your best speed could possibly be if you're trying to do the same scan over the network (unless you have to).

**zeroibis** · 10-27-2012, 12:53 AM

I have 1.3Million+ files on my array. I just keep everything organized in folders and then folders in folders. I have never had any major issues finding what I am looking for. You may want to check out Lammer Context Menu if you want to database files as that can make it easy to batch rename stuff. I used this to clean out my video library for xbmc before creating a naming standard that I use for all new file names.

**alpha754293** · 11-05-2012, 06:57 PM

Originally Posted by zeroibis

I have 1.3Million+ files on my array. I just keep everything organized in folders and then folders in folders. I have never had any major issues finding what I am looking for. You may want to check out Lammer Context Menu if you want to database files as that can make it easy to batch rename stuff. I used this to clean out my video library for xbmc before creating a naming standard that I use for all new file names.

What's xbmc?

Well, the problem with me creating folders in folders in folders (etc.) is that I have hit the path size limit before. (Doesn't take too much for me to run into it actually) - so...that's why that doesn't really work all that well for me. (And I've hit it even on ZFS on Solaris) so.....yea...

**zeroibis** · 11-06-2012, 12:13 AM

XBMC is a open source media system for a lot of different OSes. It is mainly used for HTPCs to serve media libraries.

The thing I was referring to is a different program called Lammer Context Menu which could be useful if you want to apply things like batch renaming operations to a lot of files to aid in organization.

Is there any reason why you need to have such a deep folder structure? Or are you trying to create a folder maze to hide :banana::banana::banana::banana:... lol

**tiro_uspsss** · 11-10-2012, 08:59 PM

ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help?

**alpha754293** · 11-11-2012, 06:07 AM

Originally Posted by tiro_uspsss

ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help?

Follow this procedure for getting data deduplication up and running...

http://technet.microsoft.com/en-us/l.../hh831700.aspx

Thread: Anybody know a good way to manage 1.5 million+ files?

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions