Setting Up a Data Analyst’s Home Lab — Part 1
In my previous article, I discussed the motivation and why behind why I’m starting a homelab. If you’re curious for a bit more context, I highly recommend you read that article before continuing on. I’ll still be here when you’re done.
Well, now that you’re caught up or brave enough to plod on, let’s get to the first order of business — setting up storage!
Setting Up Storage (NAS)
If I want to do any kind of data crunching and processing, I need to create a home where the data will live! Welcome to the NAS (short for Network Attached Storage.) I managed to snag a Terramaster F2–220 and two 2TB hard-drives for my initial setup. I do not anticipate needing much more storage than this in the short to medium term.
One of the biggest decisions you have to make with a NAS — and especially if you get a 2-bay NAS — is if you’re going to configure the device into a RAID configuration. The advantage of a RAID configuration is that your data storage can withstand one (or more) hard drive failures and keep trucking along. When 24/7 access to your data is critical, a RAID configuration can make a ton of sense. Similarly, if the data is super important to you and you want to minimize the chance of data loss, RAID can also make a ton of sense.
So, if you’re writing a senior/graduate thesis and don’t want to be sobbing epic tears if your hard drives decide to quit on you… RAID may be worth the peace of mind. Of course, this brings me to a couple of the disadvantages of using a RAID configuration.
The first disadvantage is storage loss. When you’re going for storage on a budget, the loss of 50% of your total storage for a RAID-1 configuration can be hard to swallow. Although in total I have 4TB of hard drive space, I actually only have 2TB with a RAID-1 set up since the information is mirrored.
Another disadvantage of RAID is the false sense of security it can lure you into. Just like your favorite blanket feels nice and safe, your NAS with a RAID array can make you feel impervious to data destruction… until you spill a cup of hot chocolate. In this example, if both hard drives were destroyed, you’ve still lost the data.
You’ll still be crying long and hard if you didn’t have an alternative back-up solution. What, you thought that your NAS was your data backup? Unfortunately, you were wrong. A lone NAS should not be thought of as a backup solution. Especially for people like you and me, it’s often worth it (and cheap enough) to back up our data to a cloud provider.
Based on these considerations, I ultimately decided to take a middle path. I elected to lose 50% of my potential storage for the RAID-1 configuration on my Terramaster, but will also be paying to use a cloud service. My rationale behind which service I chose is outlined in the next section.
The setup process for the TerraMaster (and most off-the-shelf NAS’) is pretty straightforward. Once you connect to them via your phone or desktop, the installer walks you through everything and it becomes plug and play.
Although building my own would have been (slightly) cheaper and pretty fun, for my use case I decided the convenience of an off-the-shelf solution and forgoing the opportunity cost of a couple of hours to learn how to build and set up my NAS was worth it right now.
I would eventually like to build my own large-scale NAS once I secure a sweet data analyst job though. Below you’ll find a few resources that were useful for me when I was considering building my own NAS.
Resources to Build a NAS:
My Backup Services and Schedule
Earlier, we established that one precariously placed cup of hot chocolate can wipe out all our data and make us see our dreams of a perfect thesis go up in smoke (perhaps literally.) Since we’re not running an enterprise, a backup to an online cloud provider is a great idea. Of course, there is some risk of that provider getting compromised and subsequently your data getting out, but welcome to the wild west of the internet.
Since most of the data I plan on backing up isn’t sensitive, I’m fine with using commercial options for these data (like the courses I’ve collected etc.) Of course, it would be handy for me to also store some of my own sensitive data in the same place and at the same time.
At first, I thought I’d have to use two separate services (one for my sensitive data and one for my non-sensitive data.) Upon further investigation, I realized I could just go with one backup/cloud provider and encrypt the data prior to uploading with cryptomator.
Although going with cryptomator isn’t a perfect solution, it’s pretty good for my purposes. It’d give me the peace of mind I’d need with my more private data and allow me to use one backup service.
There are several options for cloud providers, some really well known and some less well known. I wanted a bit of a break from the Google ecosystem, so I decided that Backblaze B2 was going to be my cloud provider of choice. I’ll leave links for some common alternatives below for you to explore:
Links to Cloud Providers (non-affiliate):
After I had done the hard part, it was time for a snack break!
Revived after my snack, I came to the following conclusions. I’m ultimately going to have my data in 3 places — on a nextcloud instance on my home server (coming soon) which will back up to the NAS, on my main machine(s), and then in the cloud with Backblaze B2. This setup allows me to be in compliance with the 3–2–1 rule (linked for those interested in some standard practices.) Once I set up the automatic backups, it should be a fairly set-and-forget process (which I’m thankful for!)
Oh the Data I’ll Store
With these boring but important steps out of the way, I now feel like I’ve set the foundation to back up my existing computers and start construction of the homelab. In the coming weeks, I’ll be adding my laptop for an impromptu media server and the beast of a Dell for more intense data crunching and random tinkering.
Of course, this (reasonably) robust infrastructure would be a complete waste of time if I didn’t have data to store or know what I was going to store; so, you may be asking, what data is going to live in my ecosystem. Well, I’m glad you asked! Below is a summary of the types of data I plan on storing:
- Boring personal data (think adulting stuff like taxes, health stuff, etc.)
- Fun media to include:
- eCourses I’ve downloaded
- Favorite webcomic series
- YouTube videos and select films (including my old DVD’s that I’m going to digitize)
- MySQL databases (for self-education)
- Datasets for school or personal projects
- STL and GERBER files for upcoming projects
- More suggestions?
I’m looking forward to this next phase. Soon my labor will start to pay off and then I can write about how I’m using the homelab to advance my knowledge and career as a data analyst (yes, optimizing my personal media consumption is also a key part of my strategy.)
This part of my Homelab series established my choice of NAS, the various trade-offs I considered and which ultimately made sense for me, and how I planned to keep my data safe.
Although data management is less sexy than loading a model and doing some analysis, proper storage and safeguards on data are vital to the work we do.
I believe that understanding the other side of the equation a little more (e.g. how data gets stored, secured, and ultimately prepared to be pulled) will make me a better analyst. I’m looking forward to my next steps where I download some data and finish setting up the first version of my homelab.
I’d love to hear any thoughts, tips, tricks, or advice you have! Until we meet again, may you find the fun, laughter, and adventure that await you!