Page 2 of 2 FirstFirst 12
Results 26 to 44 of 44

Thread: Anybody know a good way to manage 1.5 million+ files?

  1. #26
    Xtreme Member
    Join Date
    Aug 2009
    Location
    Somewhere
    Posts
    220
    If you're looking for something to fit the scope of multiple different formats not all supported one or two simple applications, you could:

    1. look into an enterprise level data management application (probably really expensive)
    2. have someone/yourself build you a custom application (less expensive but may take a while)
    3. use good file management methods and build a better, forward moving, file structure

    You could use something from the adobe suite to catalog part of your dataset, but i'm sure it won't get all of it. and the mysql (I would assume that's what it is from the discussions had here thus far) back-end will crap out eventually because it's most likely not tuned for huge sets of data. To have something like this happen automagically, you will probably have to invest quite a bit.

    I could build a database that would handle the metadata (keywords,filetype,date created,edited, etc...) that pointed to your files, but there would have to be a manual input of this metadata. Though you could be able to group files and input groups of them with the same metadata so that wouldn't take as long, then if you had more notable files, you could add something to those individually.

    The best thing I would suggest though would be to build a file structure that breaks these files down into meaningful sub-groups, and maintain it, and stick to it. At this point, there is probably either going to be a lot of time invested into this project, or a lot of money.

    Desktop (and Cruncher #1):AMD Phenom II x6 1090T @ 4.03Ghz | Gigabyte MA790FXT-UD5P (F8n) | G.Skill Ripjaws 2x4GB @ 9-9-9-24-1T 1680MHz | Radeon HD 5850 & 5830 | Silverstone ST75F 750W | 60GB OCZ Vertex 2 3x1TB WD RE3 (Raid 5) | Lian Li PC-A70B
    Cruncher (#2): Intel Core I7 920 (stock) | EVGA X58 SLI | G.Skill Pi 3x2GB | 2x Radeon HD 6870 | Corsair HX850 | Some Janky HDD | LanCool PC-K7
    Cruncher (#3): Intel Core I7 2600k (stock) | BioStar TH67+ | G.Skill Ripjaws 2x4GB | Antec Basiq550 | Some Janky HDD | Antec 300
    Server: Intel Atom | 2x2GB DDR3 | ThermalRight TR2-430 | Some Less Janky Laptop HDD | Fractal Core-1000
    Mobile: Lenovo X120e

  2. #27
    Registered User
    Join Date
    Aug 2012
    Posts
    70
    Quote Originally Posted by alpha754293 View Post
    Anybody know of a good way to manage 1.5 million+ files?
    They all vary in size and distribution and type and format. There's no standard naming convention.
    Hopefully there'd be a way to automatically index what's on the array, register the file into a database, and make searching for a file a lot faster.
    Thoughts/suggestions?
    1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder )

    2 easy ways to reduce storage needs and speed up access:

    1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

    2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

    Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

    Andy

  3. #28
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Thanks! I'll have to take a look at that.

    *edit*

    I just downloaded the Windows Server 2012 RC and giving it a shot. I was also reading from the link that you sent that it processes about 100 GB/hour and right now I've used about 9.7 TiB so which means it'll need at least 9.7 hours for it to go through the entire array of data that I've got already.

    But it seems like it's a nifty little tool.

    However, sometimes (quite often); I'm not necessarily duplication bound because - for example - when I am running multiple simulations where I am trying stuff out - it's actually quite easy for me to exceed the path-size limitations of most FSes.

    And sometimes, the changes can be very subtle.

    But we'll have to see how that goes. Thanks for the info though. Never knew about it.

    *edit*
    9 hours 46 minutes later, and it's only processed 646 GB out of 9.7 TB (or about 3% or so). This is definitely MUCH slower than the 100 GB/hour. (The array itself is able to read/write at some 115 MB/s, but the poor little old Xeon 2.8 GHz might be struggling to keep up with it.)
    Last edited by alpha754293; 09-08-2012 at 03:21 AM.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  4. #29
    Xtreme Member
    Join Date
    Mar 2012
    Location
    Brisbane,Australia
    Posts
    182
    put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P
    Work/Gaming: i7-950|GB X58A-UD7|12GB Trident BBSE/XMS3|460GTX|WD 1TB BLK|Pioneer DVDRW|CM HAF-X|Win 7 Pro 64 bit|U2711|HX850|G500|G510
    Quote Originally Posted by hiwa View Post
    I protect my gskills like how i protect my balls
    Heatware: jimba86
    Bench:Custom Giga-bench|Win 7/XP SP3|WD 36GB Raptor|Dell 22|AX1200|MS intellimouse
    Bench 1:i7 930|water 2.0 performer|Gigabyte X58A-OC | 4GB corsair 1866 CL7|GTX 295 Quad SLI
    Bench 2:E8500|NHD14|P45-UD3P(2nd PCIEx16 slot broken.. )|2GB Corsair 8888 Cl4|GTX 260 SOC
    (Bench 3: In Progress) 4770K|F1EE|Gigabyte Z97X-SOC Force| 4GB GTX1 /8GB Gskill TridentX 2666CL11 ney pro|5870 x3 on KPC tek9 slim 5.0/7.0
    Bench 4 E8600|NHD14|REX|2GB Corsair 1800 Cl7|Asus GTX 280
    Bench 5 (TBC): FX?|990FX-UD7|Gskill Flare 2000 Cl7|6970
    Server/renderbox: G3258|TT Water 2.0|Gigabyte Z87 Sniper M5|8GB Gskill Sniper 2133|gigabyte 5750|FD NODE 804

  5. #30
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by jimba86 View Post
    put it on the biggest hardrive you can find and forgot about it? free space on the old disk and you can still access in when you need to..:P
    Well...unfortunately, a) I don't have access to prototype 4 TB and 6 TB drives. And b) I have a friend who does and it has a SERIOUS issue with them right now (that every time the head sweeps over, it flips the polarity. Bad. Apparently, he also said that that's a firmware issue. *shrug*)

    So, other than that, I'm already using 3 TB drives in the array.

    But it isn't SPACE that's the issue. It's "how fast can I get the system to locate/find/retrieve?" something for me. (And no I/O either). If you've ever had to manually sort through > 1 million files, you'll know/learn very quickly what I am talking about. Considering just building the PLAINTEXT index of all of the files that are on the array takes just under 3.5 minutes and the resulting file is already at 131 MB (!) and looking for a file in that is already considerably faster than actually searching the array itself.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  6. #31
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Currently running DeDupEval processing about 50 GB/hour, which means that it'll take an estimate 194 hours to finish going through my 9.7 TB of data already on the array.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  7. #32
    I am Xtreme
    Join Date
    Jan 2006
    Location
    Australia! :)
    Posts
    6,096
    Quote Originally Posted by Andreas View Post
    1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder )

    2 easy ways to reduce storage needs and speed up access:

    1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

    2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

    Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

    Andy
    where is the download link to that deduptool program?
    DNA = Design Not Accident
    DNA = Darwin Not Accurate

    heatware / ebay
    HARDWARE I only own Xeons, Extreme Editions & Lian Li's
    https://prism-break.org/

  8. #33
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by tiro_uspsss View Post
    where is the download link to that deduptool program?
    There isn't one.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  9. #34
    Xtreme Member
    Join Date
    Aug 2010
    Location
    perth, west oz
    Posts
    252
    i think you have to sign up and down WinSrv2012 to be able to play with this

    btw when will it be released?

    Henrik

  10. #35
    I am Xtreme
    Join Date
    Jan 2006
    Location
    Australia! :)
    Posts
    6,096
    Quote Originally Posted by tived View Post
    i think you have to sign up and down WinSrv2012 to be able to play with this

    btw when will it be released?

    Henrik
    oh I see, it's part of WS12

    WS12 has already been released!
    DNA = Design Not Accident
    DNA = Darwin Not Accurate

    heatware / ebay
    HARDWARE I only own Xeons, Extreme Editions & Lian Li's
    https://prism-break.org/

  11. #36
    Xtreme Member
    Join Date
    Aug 2010
    Location
    perth, west oz
    Posts
    252
    Between us we make a great team ;-)
    thanks

    Quote Originally Posted by tiro_uspsss View Post
    oh I see, it's part of WS12

    WS12 has already been released!
    Henrik
    A Dane Down Under

    Current systems:
    EVGA Classified SR-2
    Lian Li PC-V2120 Black, Antec 1200 PSU,
    2x X5650 (20x 190 APPROX 4.2GHZ), CPU Cooling: Noctua NH-D14
    (48gb) 6x 8Gb Kingston ECC 1333 KVR1333D3D4R9S/8GI, Boot: 8R0 SAMSUNG 830 129GB ARECA 1882IX-4GB CACHE - Scratch disk: 2x6R0 INTEL 520 120GB's, 2x IBM M1015/LSI 9240-8i, Asus GTX-580

    ASUS P5W64 WS PRO, QX-6700 (Extreme Quadcore) 2.66Ghz, 4x2GB HyberX, various hard drives and GT-7600

    Tyan S2895 K8WE 2x 285 Opteron's 8x 2gb DDR400 1x nVidia GT-8800 2x 1 TB Samsung F1 (not very nice) Chenbro SR-107 case

    Monitors: NEC 2690v2 & Dell 2405 & 2x ASUS VE246H

  12. #37
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Took nearly a week to scan my 9.7 TB of data; said that I would save about 40% with dedup. I'm waiting to back up all the data before running and going ahead with it.
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  13. #38
    I am Xtreme
    Join Date
    Jan 2006
    Location
    Australia! :)
    Posts
    6,096
    Quote Originally Posted by Andreas View Post
    1.5 mio files isn't that much for a modern filesystem these days (unless you have all files in one folder )

    2 easy ways to reduce storage needs and speed up access:

    1) access: You might take a look at "Everything". this little tool indexes only the directory structure of your Windows server/client and is blazingly fast. It has been my favorite tool for a couple of mio files since a few years. Highly recommended.

    2) Storage reduction: it is not uncommon that 1.5 mio files contain many duplicate files, soaking up storage space (and backup space). As a one time effort you could look into deduplication tools to delete all duplicate files. BTW, Windows Server 2012 has deduplication built in. Be aware that it is cluster-based and not file-based and depending on the file types stored gives back 30-70% of your consumed space. If interested, a little tool is available to evaluate what your savings with your files could be. DeDupEval Tool.

    Depending on your disk subsystem, expect a couple of hrs runtime as it need to read all files end2end for proper analysis.

    Andy
    that deduptool... can it scan/run/do HDDs that are network shares?
    DNA = Design Not Accident
    DNA = Darwin Not Accurate

    heatware / ebay
    HARDWARE I only own Xeons, Extreme Editions & Lian Li's
    https://prism-break.org/

  14. #39
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by tiro_uspsss View Post
    that deduptool... can it scan/run/do HDDs that are network shares?
    It probably can if you have it mapped as a network drive. I wouldn't recommend it though because it has to read it block-by-block so unless you're using some kind of high speed interconnect like Infiniband or 10Gbps-over-Ethernet network; it's probably not recommend.

    Scanning a 27 TB RAID5 array took nearly a week on the local system. You can do the math to figure out what's your best speed could possibly be if you're trying to do the same scan over the network (unless you have to).
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  15. #40
    Xtreme Enthusiast
    Join Date
    Dec 2008
    Posts
    522
    I have 1.3Million+ files on my array. I just keep everything organized in folders and then folders in folders. I have never had any major issues finding what I am looking for. You may want to check out Lammer Context Menu if you want to database files as that can make it easy to batch rename stuff. I used this to clean out my video library for xbmc before creating a naming standard that I use for all new file names.

  16. #41
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by zeroibis View Post
    I have 1.3Million+ files on my array. I just keep everything organized in folders and then folders in folders. I have never had any major issues finding what I am looking for. You may want to check out Lammer Context Menu if you want to database files as that can make it easy to batch rename stuff. I used this to clean out my video library for xbmc before creating a naming standard that I use for all new file names.
    What's xbmc?

    Well, the problem with me creating folders in folders in folders (etc.) is that I have hit the path size limit before. (Doesn't take too much for me to run into it actually) - so...that's why that doesn't really work all that well for me. (And I've hit it even on ZFS on Solaris) so.....yea...
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

  17. #42
    Xtreme Enthusiast
    Join Date
    Dec 2008
    Posts
    522
    XBMC is a open source media system for a lot of different OSes. It is mainly used for HTPCs to serve media libraries.

    The thing I was referring to is a different program called Lammer Context Menu which could be useful if you want to apply things like batch renaming operations to a lot of files to aid in organization.

    Is there any reason why you need to have such a deep folder structure? Or are you trying to create a folder maze to hide :banana::banana::banana::banana:... lol

  18. #43
    I am Xtreme
    Join Date
    Jan 2006
    Location
    Australia! :)
    Posts
    6,096
    ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help?
    DNA = Design Not Accident
    DNA = Darwin Not Accurate

    heatware / ebay
    HARDWARE I only own Xeons, Extreme Editions & Lian Li's
    https://prism-break.org/

  19. #44
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Ontario
    Posts
    349
    Quote Originally Posted by tiro_uspsss View Post
    ok I have a rig up & running with WS2012... can't seem to get dedup tool up & running, any help?
    Follow this procedure for getting data deduplication up and running...

    http://technet.microsoft.com/en-us/l.../hh831700.aspx
    flow man:
    du/dt + u dot del u = - del P / rho + v vector_Laplacian u
    {\partial\mathbf{u}\over\partial t}+\mathbf{u}\cdot\nabla\mathbf{u} = -{\nabla P\over\rho} + \nu\nabla^2\mathbf{u}

Page 2 of 2 FirstFirst 12

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •