Anybody know a good way to manage 1.5 million+ files? [Archive]

View Full Version : Anybody know a good way to manage 1.5 million+ files?

alpha754293

07-03-2012, 11:03 AM

johnw

07-03-2012, 09:15 PM

stevecs

07-04-2012, 04:21 AM

Well, with your comment on not coding, it gets harder. Most can be handled by logical breakups in structure (good file system(s) structured approach) for example creating different file systems or mount points for each general type (much easier to perform under unix-like OS's). Like: backups; documents (sub-divided: by type or content); graphics (static images); multimedia; music/sound; web archival; or whatever. Then you can apply different criteria to each, which also helps in breaking up metadata block (think mft) which is single-threaded on many file systems (exceptions are like xfs; zfs) and this is a performance gain so you can have multiple writes at the same time.

Anyway, if you're looking for INDEXing the data not just FS storage of the data, the above helps a little for human access, historically, I've also used indexers (Harvest when it came out but it's been left in the dust-bin. Apache has the Lucene project which tries to fill in the free nitch, as most other systems have gone commercial.

Then there are specific types of indexers for specific types of files (for example, I use Adobe bridge & embedded exif keywords for photo data of tens of thousands of my photographs). MP3 utilities exist to index mp3tag data as well. Granted both of those could also be indexed by Lucene or others but may require specific types of gatherers (code that understands the particular file types and gathers the data to be sent to an indexer).

The biggest issue for 'deep' contextual linkages is having a structured metadata format for your data. Just throwing your file cabinet at something randomnly you'll not really get any good results. Plan ahead and think of what you need/want to do with the data and build from there also trying to anticipate a need that you don't have yet. ;)

I'm re-doing a lot of that here (~5million files) with more information (mainly working on my photo data now) which is time consuming to get that tagged correctly so it can be indexed.

alfaunits

07-04-2012, 11:24 AM

Obviously, the most efficient way to just delete everything ;)

That aside... what type of indexing and search do you want? File name only or data? Data - try the Windows Indexing service, as best bet probably.
If you only need searching for file names to be faster... you can code (yes you can...) an extremely simple batch script to list every bloody file on your storage into a single list-file. Then get either Excel (easiest) or MS SQL server (probably much faster and easier to update later on), then code an extremely simple SQL command to search it.

Yes, yes, you're not a programmer... but such a batch file is simply a:
FOR /R %i in (*) do echo "%i" >>MyListFile.dat
(or maybe two commands for something better)
And the SQL command for such a specific DB is extremely simple. (errr... I don't speak SQL now, but I remember it's also one line of SELECT command only)

Indexing that much file data on the other hand... depending on the file formats, I don't see much of a point.

canthearu

07-04-2012, 10:17 PM

Obviously, the most efficient way to just delete everything ;)

Yeah, I was thinking that too .... but it seemed such a stupid answer to give :)

In retrospect though, you really do need to filter out what you need and get rid of the rest. It is obviously unsorted because you haven't cared much about the data in the past, and have somehow managed to muddle through and find data when you needed it so far.

Anvil

07-05-2012, 03:10 AM

Scanning a volume and appending to a database is part of ASU, it's just not publicly available yet.

At the basic level it tells the size of files that are found.

This is a scan of my C drive. (X25-V)

128122

Next, it will group files by Extension.

One can search using "SQL" and one can of course export results to Excel.

I've been using it for testing in general but might expand on functionality, if there is a need for such a feature.

alpha754293

07-05-2012, 11:01 AM

Ideally, the data should be converted to a database. That many files is not the most efficient way to store data.

But if that is not an option, then I'd start by creating about 1000 - 1200 subdirectories, and then dividing the files among the subdirectories. Many filesystems slow down with a million files in a directory, which is why it is a good idea to subdivide them.

As for indexing, most OSs have a built-in mechanism for that. Certainly linux and Windows do, and I guess MacOS does also. What OS are you using?

Well...I'm trying to make it easier to find the files as the filesystem get fuller and fuller. (27 TB RAID5 array, and I don't know how many files I have on it right now, but the last time that I stopped at 6 TB, I already had 1-1.5 million files; so my dummy method of indexing the contents of the array was by using the following commands:
find /share > share.txt

and then cat share.txt | grep part-of-filename-i'm-looking-for.

I would update the index every week (takes about 6-10 minutes) but that helped to speed up looking for files a little bit. But like I said, not exactly the most sophistcated way of doing it.

Now that I have 27 TB at my disposal, the problem's only going to get worse. A LOT worse.

They're not all in one directory, I don't know how many directories I've got.

OS right now is Windows HPC Server 2008.

Well, with your comment on not coding, it gets harder. Most can be handled by logical breakups in structure (good file system(s) structured approach) for example creating different file systems or mount points for each general type (much easier to perform under unix-like OS's). Like: backups; documents (sub-divided: by type or content); graphics (static images); multimedia; music/sound; web archival; or whatever. Then you can apply different criteria to each, which also helps in breaking up metadata block (think mft) which is single-threaded on many file systems (exceptions are like xfs; zfs) and this is a performance gain so you can have multiple writes at the same time.

Anyway, if you're looking for INDEXing the data not just FS storage of the data, the above helps a little for human access, historically, I've also used indexers (Harvest when it came out but it's been left in the dust-bin. Apache has the Lucene project which tries to fill in the free nitch, as most other systems have gone commercial.

Then there are specific types of indexers for specific types of files (for example, I use Adobe bridge & embedded exif keywords for photo data of tens of thousands of my photographs). MP3 utilities exist to index mp3tag data as well. Granted both of those could also be indexed by Lucene or others but may require specific types of gatherers (code that understands the particular file types and gathers the data to be sent to an indexer).

The biggest issue for 'deep' contextual linkages is having a structured metadata format for your data. Just throwing your file cabinet at something randomnly you'll not really get any good results. Plan ahead and think of what you need/want to do with the data and build from there also trying to anticipate a need that you don't have yet. ;)

I'm re-doing a lot of that here (~5million files) with more information (mainly working on my photo data now) which is time consuming to get that tagged correctly so it can be indexed.

Well, the problem is that it's a hodgepodge of files. In terms of total size, there's probably as many large files (installers, videos, etc...) as there are lots of little files (MP3s, individual result files, photos).

And not all of the content was originally generated by me.

Obviously, the most efficient way to just delete everything ;)

That aside... what type of indexing and search do you want? File name only or data? Data - try the Windows Indexing service, as best bet probably.
If you only need searching for file names to be faster... you can code (yes you can...) an extremely simple batch script to list every bloody file on your storage into a single list-file. Then get either Excel (easiest) or MS SQL server (probably much faster and easier to update later on), then code an extremely simple SQL command to search it.

Yes, yes, you're not a programmer... but such a batch file is simply a:
FOR /R %i in (*) do echo "%i" >>MyListFile.dat
(or maybe two commands for something better)
And the SQL command for such a specific DB is extremely simple. (errr... I don't speak SQL now, but I remember it's also one line of SELECT command only)

Indexing that much file data on the other hand... depending on the file formats, I don't see much of a point.

The idea of indexing it is really to facilitate searching for it at some point down the road.

At one point, I was trying to use Picasa (desktop, not web) to index some like 750,000 pictures. Unfortunately, the database file that Picasa writes actually ended up being bigger than the hard drive on the system that was doing the indexing it, so...it ran into problems. That and I couldn't move/point the Picasa database file to a location that I want to put it.

Yeah, I was thinking that too .... but it seemed such a stupid answer to give :)

In retrospect though, you really do need to filter out what you need and get rid of the rest. It is obviously unsorted because you haven't cared much about the data in the past, and have somehow managed to muddle through and find data when you needed it so far.

No data gets deleted. Or very rarely.

Scanning a volume and appending to a database is part of ASU, it's just not publicly available yet.

At the basic level it tells the size of files that are found.

This is a scan of my C drive. (X25-V)

128122

Next, it will group files by Extension.

One can search using "SQL" and one can of course export results to Excel.

I've been using it for testing in general but might expand on functionality, if there is a need for such a feature.

My thinking was that if there was some way for the system to automatically register files into a database as it is being copied/generated/produced on the filesystem, that way when I want to run a search, I run the search through the database rather than directly on the array.

And I'm hoping that this type of indexing won't end up bogging down the system so much that it kills the effective transfer rates (a la WHS dynamic load balancing).

I'm also guessing/betting on the idea that this problem isn't unique and that companies have had to deal with managing multi-million, multi-format, multi-size, and multi-distribution files before so I'm curious to see what companies do when faced with this problem, and how I can do it for 1/10th to 1/100th of the cost (with as little programming as possible).

alfaunits

07-06-2012, 04:02 PM

It is possible to add data to an index file as it gets created, and remove it as it is removed or update as it is renamed. Since the file name can be hashed to a key before it gets searched in the DB, it is possible to split data into several DBs and make it faster to index.
I'd bet some synchronization software already has something like this. Not free ones I guess :(

alpha754293

07-07-2012, 06:03 AM

Here's the breakdown by quantity

128161

*edit*
I'm already at 1.765 million files and I've only used 9 TB out of 27 TB. And I know that I have just under 37k files that take up 5.1 TB.

tived

07-10-2012, 06:49 PM

I don't know if this will help
http://www.phaseone.com/en/Image-Software/Media-Pro.aspx

but this is essential to mange an Image collection, and i think it can also handle some document

Henrik

stevecs

07-11-2012, 09:41 AM

Actually for photo/image/media organization, Bridge (comes w/ Photoshop) is actually pretty good. W/ CS6 it's 64bit (finally) which is good for large images (I have a lot of medium and large format images so this is really helpful). The problems are the index, which only does a decent job up to about 500,000 or so each, you can create multiple catalogues to get around that, and I've also pushed it above 500,000 in the past but the mysql interface below the covers is not really optimized for it. (was really hoping they would have increased it to 5,000,000 at least with going to 64bit. :( )

tived

07-11-2012, 06:13 PM

stevecs

07-11-2012, 07:12 PM

Hi Stevecs,

Bridge is not a catalog app or a database, if you want to do it that way, then you are better off looking at Adobe Lightroom 4, which has database capabilities.
Don't get me wrong, bridge is great I use it daily with 10's of gigabyte files (each!) but with something like PhaseOne Media Pro (Previously iView Media Pro / Expression Media) you can work on your catalog off-line - it is specifically for media files, image and video.
look up digital asset management or DAM

Henrik

Actually, bridge is a database (uses mysql), I have LR 4.1, but find that it's pretty useless at least in my workflow when I have to deal with hundreds of thousands of images. I see that LR is probalby more useful when you have smaller (maybe <10,000 images?) but not when you have very large collections. Mainly using EXIF/IPTC and that's indexed by Bridge's mysql backend. Most images here are archival so have to keep multi-generations cross-linked (64bit TIFF/DNG's as the raws (RGBI); then gen1-x working images as 48bit TIFF; then several genx+y deliverable images. All adobe products have real problems with 64bit dng/tiff's except bridge (others strip off the infrared channels at open without even a notification to the user opened up several cases w/ adobe on it over the years, no avail). Seems that they're really geared toward digital source images, not 100 year old raw scans images/restoration.

Your comment on working on catalogues offline is also something (at least in regards to my workflows) is kind of moot as I need to have on-line access. When you're dealing with 20+TB of gen0 (raw) images plus several copies/deltas/revisions of them which are being updated all the time, not something that is useful. Mainly use Bridge & photoshop, mainly as I'm living in photoshop all the time anyway, can't see anything in LR that I can't do better/faster in PS directly.

Now your statement on working with large (10's of GB file sizes) /THAT/ can be interesting assuming nothing else is lost if it's faster than bridge. I have images in the ~10-20GB (large format images), now am just getting into these so don't have many in yet (still working on 127; 620; 828, 120/220; and other smaller formats) but loading the 4x5's, 8x10's, etc take a while for initial indexing (once indexed and previews created no problem, it's the single-threaded scanning that's a killer).

tived

07-11-2012, 07:36 PM

Hi Steve,
Just looking at the amount of storage in your signature makes me shiver, that is an aweful lot. what on earth are you storing there - is this your own work? as in image work.

if you are looking for a very fast image file viewer then have a look at this http://www.kolor.com/xnview-software-image-processing.html

In regards to Bridge, I was of the opinion that it was a rather simple database, with very few options, as oppose to both lightroom and Media Pro or Extensis Portfolio... i am by no means an expert on databases so forgive my ignorance. I do know that Media Pro is out of the box very good, but I also know that people who work for large library's have different solutions due to the very large quantaties of sources.

Scanning...8x10's and is fast? ouch, how about a Hasselblad X-series (old Imacon), but still that's not fast, but certainly faster then preping a drum-scan ;-) what are you using for scanning? I have an older Imacon, but its so so slow....ahh, i see them in your signature ;-)

online and offline access, has more to do with the catalog side of it, when you have to add metadata/keywording which you can do offline and apply when you are next online to the collection, as in alot of people may have their image/data stored on external devices as well as internal or directly connected. The application, keeps a preview version of the image which is portable, so you can always view the file regardless of you are on or offline, which can be handy down the track, plus Media Pro also have a external Reader which you can supply to the clinet, without the client can make modifcations to your collection.

My large images are panorama's and also layered photoshop files with work in progress, which is why the file sizes are so big in my case, but a 6x17 scan at 3200 in 16bit is only 1GB so what are you scanning? and at what resolution 8000?
thanks for sharing
Henrik

stevecs

07-12-2012, 10:38 AM

Well the storage array is a general repository for a lot of my own (and family archival) images,film,video,etc though the majority is my own. Then general storage for pretty much whatever, tech notes, docs, stuff from the past 40 years that I've found of use or reference. Indexing all of that is a major chore similar in scope as a lot of companies data actually. Had an issue a couple years ago (2008/2009) and still in the process of recovery/re-creating linkages.

I'll take a look at xnview, always looking to see what can save a couple mins as it really adds up when dealing with quantaties.

Bridge is rather simple in the database department, but most of the items that I'm using it for is as I mentioned EXIF/IPTC stuff, keywords, basically simple tables, as the one goal is to have something that is stored with the image/video itself and can be used by any future tool which minimizes re-work in 10 years. Things like 'collections' and logical mappins can be done, but hard-level structures are done by filesytem layout (yes, a filesystem is a /very basic/ type of database as well ;) ) I am keeping data now broken down by year & subject/location and medium type (bwn; cn; positives) at the FS level (and then using the data fields for more granularity (also keeping things like film type, etc where possible (some of my older stuff doesn't have readable edge information).

Imacons are nice, though the LS9000's here (just picked up another in case of the first one dying) are probably the best small/medium format scanner you can get under $10,000 that I can find. I've tried wet mounting before, but usually too much of a pain unless you're dealing with scratched film or are trying to use an image at real insane blowup sizes.

Yeah, bridge is more of a file-browser + database, LR you can continue to so some review of the previews and stuff without having the files present. Depends on your workflow if that's good or bad. I do know people who love that feature, in my case couldn't really find a use for it, mainly as I have fast network access to everything and it's allways availble I think.

I try to scan in images at 4000dpi, for a 6x7 image @ 64bit tiff that's ~900MB in size. A 6x17 would be around 2.2GB or so in size, what camera do you have for the 6x17? I love the wide medium formats (6x9, 6x12) but never owned one of those cameras just borrowed them. Problem was the lenses, they never had anything really that I liked for wide-angle. Really had to use the 4x5 and crop it which also gave me tilt/swing options though it was a pain as it was a studio not a field camera. Am keeping an eye for a good 90mm lens w/ ~310mm coverage area at f22 for a 8x10 camera build, just love the large negatives.

alpha754293

07-20-2012, 09:46 AM

See, at least for you guys, the number of data formats is pretty common/consistent throughout.

For me, when you're working with anywhere between 13-20 engineering programs (plus all of the regular miscellaneous "consumer" data formats/types (JPGs, AVIs, MP3s, DOCs, XLSs, PDFs, PPTs), it somewhat complicates things a little more.

That also makes searching or LOOKING for data (say...a few months or years down the line) a much more taxing task than I would really like it to be; hence why I am trying to have a more effective and efficient way of "cataloging" all of the information/data/files on the system. And as I do more and more stuff, the total number of files is going to grow, and I would like the speed or complexity of the search to grow linear with it rather than exponentially with it.

tiro_uspsss

08-05-2012, 12:19 AM

alpha754293

08-10-2012, 11:03 AM

I admittedly didn't read the thread thru, but here is a recent little something I found that is handy: change the location of where the index file(s) (windows) are located. I'm not sure where windows puts the index files for each HDD, but I have mine set so that all index files are stored on SSD, so when I do a search, results are fast! :) :up:

Once you get into the millions of files, it doesn't really seem to matter nearly as much.

I make the text file because it does help facilitate searching for files (especially since the index of the array - as far as I know - cannot be read remotely). For local systems, that's probably a true statement.

But when you're scanning a 27 TB array, I don't know if there's a way for me to tell Windows to pick up the index that's stored on a SSD on the network server to be read over the network.

tived

08-10-2012, 04:18 PM

alpha754293

08-10-2012, 07:17 PM

couldn't hyperlink the text file (content), to then go to the found target?

just a guess

I doubt, that there is a viewer that can read all and any fileformats, or should I say display them visually for recognition. On the other hand, if you instead break the database into types, and then apply your search with them instead, then you should be save huge amounts of time

Henrik

Well, part of the idea of this is that it should be as automated as possible. In my mind, I was hoping that there'd be some way that it would automatically log itself into a database so that when I need to look for a file, I can just run the search on the database rather than on the filesystem itself.

Apparently, I'm dreaming too much.

And because I am probably likely going to be able to ALWAYS generate the files faster than I would be able to link/log them myself; so...it's trying to come up with an intelligent way of handling that much data that's very diverse in both type and size.

*edit*
And that I'm also going on the huge assumption that this problem can't possibly be a new problem because I'm fairly certain that enterprise has ran into this problem probably many, many years ago. So I was curious to see what some of the experts have to say about how they would tackle a problem such as this.

tived

08-10-2012, 07:39 PM

tived

08-10-2012, 07:45 PM

When I implements Image databases for photographers and designers, I am using pre excisting applications such as MediaPro/iViewMedia/ExpressionMedia or Portfolio / Lightroom all of these are image or media based. However, I also implement, a file structure so that the user can find them manually, should the database fail!!! This is just as important as having a good Database to me. 1.5 mill probably isn't that many files anyway ;-)

Organise things by type, relevence, date, size...etc...

I don't code I only design ;-) and it probably shows ;-)

Henrik

tived

08-10-2012, 07:50 PM

Also strengthening these data bases is done by adding key-wording to the files/images describing them in other ways then their physical characteristics - but all this is time consuming. There are file databases out there for cross-platforms, but open source as well as proprietary

alpha754293

08-11-2012, 04:23 AM

I don't think you are dreaming too much ;-)

How long is too much time spending on searching or locating a file? and is the file you are looking for recognisable, as in do you know what you are looking for, in filename, type etc...? or is the search more random, as in groups of files. Is it only you who will be searching or will it be a group of users, some with no knowledge of the content?

I am coming from an Image graphics background, so I have visual aids to help me, but some files are just cryptic and will only be recognised by name or by function within its application

Do you see where I am going with this.... apart from a long string of questions ;-)

Henrik

Well, I haven't had the "need" to perform too many searches on my new system right now because I do build the index weekly, and it's just a text file that tells me the contents of the array by filename. MOST of the time, that's enough for me to get around because then I can just scan through said text file and find/pick what I want.

Having said that though, my current index is already 126 MB and even looking for files JUST in the text file alone takes about 30 seconds a pop. So, if I'm looking for a file where I DON'T remember the exact filename and I have to execute a few of the searches (often remotely - i.e. I'm reading the text index file from another computer), it takes a little while. And while 30 seconds isn't much, but 30 seconds here, 30 seconds there; eventually it all adds up.

And that's ALREADY faster than the default Windows searches. (And for comparison purposes, when I remotely log into the server to update the index, it takes about 4 minutes 4 it to build it.)

I'm the only one that's searching through the system. MOST of it is relatively organized (although that's kind of going by the wayside a little bit as I go along), so say...if I want movies, I can go to the movies folder and find most of what I need there.

Other times it's more random. If I'm trying to find a particular result from a simulation that I performed, it might not necessarily be nearly as organized or well identified/marked/tagged as that.

See that's the thing with most people though - they come from one specific background so the types and sizes of files is somewhat limited. For me, because just in engineering applications alone, I deal with 13 different programs (some of it is multiple CAD systems from various designers, some of it because software to me is a tool. And I pick the best tool for the job. So it's not really all that uncommon for me to mix and match different apps just to get ONE task done.) As a result, you end up with a whole ton of different types of files. For example, ICEM CFD geometry files are called .tin. CATIA parts are called .CATPart. CATIA assemblies are .CATProduct. Solidworks parts are .sldprt. Solidworks assemblies are .sldasm. ICEM CFD project files are .prj. Ansys Workbench project files are .wbpj. CFX meshes are either .gtm or .def. Fluent meshes are .msh. etc....)

With pictures, there's really only a very limited number of options that programs will likely generate. And since there are only a few major players for photos and image editing software; therefore, the number of the types of files and the sizes of the files are usually relatively well defined (for a given size and resolution of a picture for example).

For me, some of my result files are only like on the order of tens of megabytes a piece. In others, the temporary scratch file that the program produces is over 10 GB. And I pretty much have everything in between. So when I come from a multiple background like that, it changes things. Defining standard rule sets almost no longer apply because of the lack of a "one-size-fits-all" type solution.

Can I go "find all *.trn (CFX transient result files)?" Yes, sure. But for one of my simulations, there's I think over 3000 of them JUST for that run alone. (Run it a few times, and you can quickly see how that won't really be useful when you're presented with 15,000+ answers; grouped "somewhat" based on the fact that it was for that particular simulation run.

Like I said, I can't imagine this being a new problem at all; and I would think that companies face/deal with this all the time, so there HAS to be a solution that's probably already available out there somewhere.

When I implements Image databases for photographers and designers, I am using pre excisting applications such as MediaPro/iViewMedia/ExpressionMedia or Portfolio / Lightroom all of these are image or media based. However, I also implement, a file structure so that the user can find them manually, should the database fail!!! This is just as important as having a good Database to me. 1.5 mill probably isn't that many files anyway ;-)

Organise things by type, relevence, date, size...etc...

I don't code I only design ;-) and it probably shows ;-)

Henrik

1.5 million is only at the current state of the array. I've only used 1/3rd of it. Which means, at the current rate, the expected total number of files on the array could very easily top 4.5-5.0 million.

If at 1.5 million already, and my index ALONE is 126 MB; that means that by the time I'm done, the INDEX alone would be sitting at nearly 375 MB; and running a search through that will take AT LEAST 1.5 minutes (each) assuming linear scalablity (which probably isn't true; but some kind of low power/exponential curve).

Also strengthening these data bases is done by adding key-wording to the files/images describing them in other ways then their physical characteristics - but all this is time consuming. There are file databases out there for cross-platforms, but open source as well as proprietary

You try key-wording 3000 files that comes from a simulation that runs for 7 days. ;o) Can you imagine having to sit there and manually tag/keyword/add METADATA for each of those files in order to help/facilitate the search? ;o)

(3000 files in 7 days isn't that bad. But what is bad is that you'll have to stay awake for the 7 days, manually tagging the files. ;o))

So the idea/intent would be that the database would automatically detect that there's a new file onboard the array, it will TRY to identify what type of file it is, and that if there are similar files that's being generated say...either about the same time, or that's about the same "type" (loosely), that it would automatically tag it for me, and "register"/"acknowledge it's presence" in the database.

The hard part would be if I am trying different settings in a simulation (for example - say instead of simulating air, I'm simulating water), the result file is actually going to look pretty much identical as far as the file system/array/database is concerned. But obviously, the contents within it would be DRAMATICALLY different (you get very different results if you're trying to do something that's supposed to be done with air, but you do it with water). And even then, at least it would be easy to tell by the results themselves. But suppose you have another case where I'm simulating a car engine and instead of testing "87 pump gas" I'm simulating "91 octane premium gas"; the differences in the simulation specification is actually very, very subtle. (Change two numbers, re-run.)

THAT difference might be very minute, especially in the very very beginning. But I would still somehow want the database to be able to recognize that difference.

MasterOfTheReal

08-11-2012, 04:50 AM

My 2c.
I think at the 1million files mark your better off starting to look at Digital Asset Management solutions such as Documentum by EMC or some of the open source alternatives http://www.opensourcedigitalassetmanagement.org/reviews/available-open-source-dam/ as your not likely to find a "desktop" style solution at that kind of file volume. Media clearing houses and post production studios regularly deal with similar volumes and I haven't encountered any that have ended up being happy with any desktop based solution.

desnudopenguino

08-11-2012, 07:26 AM

If you're looking for something to fit the scope of multiple different formats not all supported one or two simple applications, you could:

1. look into an enterprise level data management application (probably really expensive)
2. have someone/yourself build you a custom application (less expensive but may take a while)
3. use good file management methods and build a better, forward moving, file structure

You could use something from the adobe suite to catalog part of your dataset, but i'm sure it won't get all of it. and the mysql (I would assume that's what it is from the discussions had here thus far) back-end will crap out eventually because it's most likely not tuned for huge sets of data. To have something like this happen automagically, you will probably have to invest quite a bit.

I could build a database that would handle the metadata (keywords,filetype,date created,edited, etc...) that pointed to your files, but there would have to be a manual input of this metadata. Though you could be able to group files and input groups of them with the same metadata so that wouldn't take as long, then if you had more notable files, you could add something to those individually.

The best thing I would suggest though would be to build a file structure that breaks these files down into meaningful sub-groups, and maintain it, and stick to it. At this point, there is probably either going to be a lot of time invested into this project, or a lot of money.

Andreas

09-04-2012, 09:41 AM

Anybody know of a good way to manage 1.5 million+ files?
They all vary in size and distribution and type and format. There's no standard naming convention.
Hopefully there'd be a way to automatically index what's on the array, register the file into a database, and make searching for a file a lot faster.
Thoughts/suggestions?
1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder ;) )

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything (http://www.voidtools.com/download.php)". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built (http://technet.microsoft.com/en-us/library/hh831700.aspx) in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool (http://blogs.technet.com/b/klince/archive/2012/08/09/evaluate-savings-with-the-deduplication-evaluation-tool-ddpeval-exe.aspx).

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

alpha754293

09-07-2012, 03:58 PM

Thanks! I'll have to take a look at that.

*edit*

I just downloaded the Windows Server 2012 RC and giving it a shot. I was also reading from the link that you sent that it processes about 100 GB/hour and right now I've used about 9.7 TiB so which means it'll need at least 9.7 hours for it to go through the entire array of data that I've got already.

But it seems like it's a nifty little tool.

However, sometimes (quite often); I'm not necessarily duplication bound because - for example - when I am running multiple simulations where I am trying stuff out - it's actually quite easy for me to exceed the path-size limitations of most FSes.

And sometimes, the changes can be very subtle.

But we'll have to see how that goes. Thanks for the info though. Never knew about it.

*edit*
9 hours 46 minutes later, and it's only processed 646 GB out of 9.7 TB (or about 3% or so). This is definitely MUCH slower than the 100 GB/hour. (The array itself is able to read/write at some 115 MB/s, but the poor little old Xeon 2.8 GHz might be struggling to keep up with it.)

jimba86

09-07-2012, 04:11 PM

put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P

alpha754293

09-07-2012, 05:29 PM

put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P

Well...unfortunately, a) I don't have access to prototype 4 TB and 6 TB drives. And b) I have a friend who does and it has a SERIOUS issue with them right now (that every time the head sweeps over, it flips the polarity. Bad. Apparently, he also said that that's a firmware issue. *shrug*)

So, other than that, I'm already using 3 TB drives in the array.

But it isn't SPACE that's the issue. It's "how fast can I get the system to locate/find/retrieve?" something for me. (And no I/O either). If you've ever had to manually sort through > 1 million files, you'll know/learn very quickly what I am talking about. Considering just building the PLAINTEXT index of all of the files that are on the array takes just under 3.5 minutes and the resulting file is already at 131 MB (!) and looking for a file in that is already considerably faster than actually searching the array itself.

alpha754293

09-08-2012, 06:05 PM

Currently running DeDupEval processing about 50 GB/hour, which means that it'll take an estimate 194 hours to finish going through my 9.7 TB of data already on the array.

tiro_uspsss

09-13-2012, 01:22 AM

1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder ;) )

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything (http://www.voidtools.com/download.php)". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built (http://technet.microsoft.com/en-us/library/hh831700.aspx) in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool (http://blogs.technet.com/b/klince/archive/2012/08/09/evaluate-savings-with-the-deduplication-evaluation-tool-ddpeval-exe.aspx).

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

where is the download link to that deduptool program? :confused:

alpha754293

09-14-2012, 06:36 PM

where is the download link to that deduptool program? :confused:

There isn't one.

tived

09-14-2012, 07:35 PM

i think you have to sign up and down WinSrv2012 to be able to play with this

btw when will it be released?

Henrik

tiro_uspsss

09-15-2012, 02:56 AM

i think you have to sign up and down WinSrv2012 to be able to play with this

btw when will it be released?

Henrik

oh I see, it's part of WS12 :(

WS12 has already been released! :up:

tived

09-15-2012, 06:26 PM

Between us we make a great team ;-)
thanks

oh I see, it's part of WS12 :(

WS12 has already been released! :up:

alpha754293

09-20-2012, 02:50 AM

Took nearly a week to scan my 9.7 TB of data; said that I would save about 40% with dedup. I'm waiting to back up all the data before running and going ahead with it.

tiro_uspsss

10-10-2012, 03:50 AM

1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder ;) )

2 easy ways to reduce storage needs and speed up access:

1) access: You might take a look at "Everything (http://www.voidtools.com/download.php)". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built (http://technet.microsoft.com/en-us/library/hh831700.aspx) in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool (http://blogs.technet.com/b/klince/archive/2012/08/09/evaluate-savings-with-the-deduplication-evaluation-tool-ddpeval-exe.aspx).

Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

Andy

that deduptool... can it scan/run/do HDDs that are network shares? :)

alpha754293

10-10-2012, 06:45 PM

that deduptool... can it scan/run/do HDDs that are network shares? :)

It probably can if you have it mapped as a network drive. I wouldn't recommend it though because it has to read it block-by-block so unless you're using some kind of high speed interconnect like Infiniband or 10Gbps-over-Ethernet network; it's probably not recommend.

Scanning a 27 TB RAID5 array took nearly a week on the local system. You can do the math to figure out what's your best speed could possibly be if you're trying to do the same scan over the network (unless you have to).

zeroibis

10-27-2012, 12:53 AM

alpha754293

11-05-2012, 06:57 PM

I have 1.3Million+ files on my array. I just keep everything organized in folders and then folders in folders. I have never had any major issues finding what I am looking for. You may want to check out Lammer Context Menu if you want to database files as that can make it easy to batch rename stuff. I used this to clean out my video library for xbmc before creating a naming standard that I use for all new file names.

What's xbmc?

Well, the problem with me creating folders in folders in folders (etc.) is that I have hit the path size limit before. (Doesn't take too much for me to run into it actually) - so...that's why that doesn't really work all that well for me. (And I've hit it even on ZFS on Solaris) so.....yea...

zeroibis

11-06-2012, 12:13 AM

XBMC is a open source media system for a lot of different OSes. It is mainly used for HTPCs to serve media libraries.

The thing I was referring to is a different program called Lammer Context Menu which could be useful if you want to apply things like batch renaming operations to a lot of files to aid in organization.

Is there any reason why you need to have such a deep folder structure? Or are you trying to create a folder maze to hide :banana::banana::banana::banana:... lol

tiro_uspsss

11-10-2012, 08:59 PM

ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help? :shrug:

alpha754293

11-11-2012, 06:07 AM

ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help? :shrug:

Follow this procedure for getting data deduplication up and running...

http://technet.microsoft.com/en-us/library/hh831700.aspx