Need help with tech storage issues

Discussion in 'Miscellaneous' started by ElfinPineapple, Feb 11, 2018.

  1. Some of you might know, whether from recent statuses or conversations you've had with me, that I'm a big politics junkie and am looking into getting a PhD in the field. As part of the studies I've been interested in I've been downloading a ton of legislative journals from both the national legislature in the US as well as the legislatures of the various states within the country.

    Problem is a lot of them are scanned. And as many of you know when it comes to scanned files, they take up a lot of space. Downloading those journals is going to take up a significant amount of hard drive space. How much?



    Let's make this better. There's a combined 9106 years worth of files I need to grab at some point, plus 51 combined years for each additional calendar year I'm insane enough to continue splurging on this data. Right now I would estimate that I've got roughly....450 years worth in that 1.01 TB estimate above?

    *inserts big problem gif that he cannot find here*

    Two options I'm looking into and hopefully there's a brain out there which knows about these options way better than I do:

    Option 1: Purchase a Dropbox business account that has no space cap.
    Option 2: Purchase a NAS home server to dump these files into.

    Both come with pluses and negatives that I can find, but what I'm hoping to get are thoughts on what would be the best way to handle the issue. I've got a 5 TB seagate external that's handling everything right now, but if that number is accurate as far as the pacing is concerned I'm gonna be running it out of the building before too long.

    Thanks in advance for anyone who wants to weigh in. This is likely going to be the foundation of my career in the future so I'm open to opinions on the two options or any other options you want to throw into the pool.
  2. I think the Dropbox account will be the way to go, personally. May be a little more costly but I think it's a lot more secure, cost effective, and easier to use in general. It's also much more portable as well, being able to access the files anywhere.
    ElfinPineapple likes this.
  3. May I Also suggest compressing the files? That will save on so much space.
    ElfinPineapple likes this.
  4. When you're saying compressed, are you referring to dumping the in a .zip/.rar file or something entirely different?
  5. A rar file, but there are other programs you can use as well. I compress files 1TB large to 500GB in size to save storage.
    ElfinPineapple likes this.
  6. Would that work for individual PDF files compressed into a folder by year? It's totaling about 1TB but the range of the 40,000+ files is anywhere from less than 25 KBs to over 400 MBs in size. Median seems to be in the 90-100 MB range from what I can see.
  7. Yep, I tend to compress old Seasons converted to a DVD format.
    ElfinPineapple likes this.
  8. Another option would be to get a good OCR setup which can grab the text from the images, then store that in your regular digital format. How much this is doable or not heavily depends on the quality of the scans obviously.
    ElfinPineapple likes this.
  9. What would be needed to go this route?
  10. Before choosing a path, give some thought to how much risk you are comfortable with. Can the documents be recovered if lost? (Can you store an index and re-download from source?)

    If these are image documents, can you save them as black-and-white?

    Converting to text is an option, since text files are smaller than images of text files. Expect some small percentage of data loss though, unless both the conversion utility and the scans are high quality. Text compresses really well when the language and terminology is common.
    ElfinPineapple and FadedMartian like this.
  11. With the exception of a small subset of data - about 5 years of national legislature data only from the late 1700s that I obtained from my university while I was a student - everything was gained from publicly accessible sources to this point.

    Fortunately most everything obtained has been from official government websites so the likelihood of access disappearing is low. Unless I won the lottery in the worst way possible the bigger concern here would be time required to redownload in the event a loss occurred as well as the bandwidth required to reobtain everything at once.

    I had a partial loss early in the project due to an unexpected drive failure and it took a couple hours to recover the data. Most was in the cloud but additions I made that day were not available, so the site had to be located and redownloaded as well as the DropBox back up being redownloaded. It sounds like an index would expedite the recovery should it happen again, so I'm really interested in that particular idea. Do you have any resources that could help me develop one with the files I have now? While all the data I have now is certainly too much to store in my current Dropbox account due to size limits, it would be useful backing up an index there if the file size is reasonable.

    Regarding your black-and-white conversion and text conversion comments: most of the files so far are already in black and white but I will investigate and convert if it warrants. Since the majority of the files are text based documents saved in PDF format it's likely the conversion won't save too much data, but I'll take what I can get with the large amount of file sizes. Text is also an option, though if I can make the files searchable through Adobe Acrobat I'm pretty sure I can scrape up some Python code to sift through it.

    Thanks for bringing up the data security end JP and thanks all for the thoughts you've had so far. Many things I had not considered have been proposed here so I greatly appreciate it and any future comments/suggestions that come forward.