AUTHOR: http://aws.typepad.com/aws/2008/05/redundant-disk.html LINK!

Recent AWS Customer Success Stories & Videos

More AWS Customer Success Stories...

« Use Amazon SQS to Build Self-Healing Applications | Main | Amazon SimpleDB Case Studies - ShareThis and Alexa »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c534853ef00e5521e7cfe8833

Listed below are links to weblogs that reference Redundant Disk Storage Across Multiple EC2 :

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Allen

The paper describes a very complex solution, with a lot of opportunities for things to do wrong. It would be much simpler to just use PersistentFS, http://www.PersistentFS.com

bisho

Won't be much easier and faster to use OCFS2?

That way you could have the SAME persistent storage device attached to two EC2 instances, in write mode. You don't need drbd that showed to be quite slow on my testings. drbd is a software tool to simulate a shared disc, but if you can directly have a shared disc it doesn't make sense.

OCFS2 as I said offers write support on all mounted nodes, so you can just avoid the need of two EC2 instances for NFS servers, plus all the complexity of tunnels, NFS and Heartbeat.

Just mount the needed persistent storage device in all needed AMIs, using OCFS2 to be able to write & lock files from several machines at the same time without conflicts.

And using OCFS2 directly over the persistent storage device is probably faster than NFS to access a persistent storage device.

M. David Peterson

@Allen,

>> The paper describes a very complex solution, with a lot of opportunities for things to do wrong.

The paper itself describes what's taking place and how it works, but the reality is you don't need to understand the complexities to run the automated script.

>> It would be much simpler to just use PersistentFS, http://www.PersistentFS.com

Well sure. If you want to learn the ins and outs of how PersistentFS works, however, you would need to understand a lot more than just setting up a configuration file and starting up the process. This process represents and in depth understanding of what's going on under the hood. As per above, if you don't want to know all of this, just run the script.

Also, PersistentFS is a commercial product. This uses off the shelf, open source projects, each of which have active communities to support them. That's not to take away from PersistentFS. Just simply a clarification to ensure people understand the difference.

M. David Peterson

@bisho,

>> Won't be much easier and faster to use OCFS2?

How so?

>> That way you could have the SAME persistent storage device

But you wouldn't have the benefit of failover if the node went down. Persistent storage is only part of the benefit this solution provides.

>> attached to two EC2 instances, in write mode.

You can have the same persistent AND redundant device attached to as many EC2 nodes as you want with this solution, so I'm not sure I understand your point.

>> You don't need drbd that showed to be quite slow on my testings. drbd is a software tool to simulate a shared disc, but if you can directly have a shared disc it doesn't make sense.

DRBD provides redundancy. If this was just about persistence, then something as simple as using the S3Sync utility and NFS could provide a useable solution. Of course, you're always going to run risk of potential data loss, and if you have to rebuild a node from scratch, it's going to take a while to rebuild the local disk of a new instance if the data is of significant size.

>> OCFS2 as I said offers write support on all mounted nodes,

So does NFS.

>> so you can just avoid the need of two EC2 instances for NFS servers, plus all the complexity of tunnels, NFS and Heartbeat.

Not if you want redundancy and fail-over built into your solution.

>> Just mount the needed persistent storage device in all needed AMIs, using OCFS2 to be able to write & lock files from several machines at the same time without conflicts.

We're talking about two different things here. You're referring to a distributed file system solution. This is data persistence, redundancy, and automatic fail-over.

>> And using OCFS2 directly over the persistent storage device is probably faster than NFS to access a persistent storage device.

Not sure. Would have to test. But again, as long as I understand OCFS2 correctly, this is simply a Cluster File System which provides similar benefits to that of NFS. So what you are suggesting seems to be one of "OCFS2 is better than NFS", something of which I would have to test before drawing any conclusions. The problem, of course, is if the main OCFS2 node goes down then the entire system goes down with it. Unless, of course, OCFS2 provides built-in data redundancy across multiple nodes? If yes, then this would be interesting to play around with, but nothing I have found thus far suggests this is what OCFS2 provides.

Can you clarify?

M. David Peterson

@bisho,

Another interesting project to take a look at is MogileFS, which in many ways can be compared to memcached (both MogileFS and memcached were written by Brad Fitzpatrick, so it's not surprising they have similar features) but for disks instead of memory.

http://danga.com/mogilefs/

The problem with MogileFS is that it's not POSIX compliant, so you have to write custom code to read/write from the distributed file system and you can't run things like a standard DB against it as a result. None-the-less, it's worth taking a look at. And if OCFS2 provides any sort of "no-single-point-of-failure" capabilities (which I don't believe it does, but I could very easily be wrong) then it would *definitely* be worth taking a look at.

bisho

But it seems that PersistentFS doesn't offer locking, shared access protection and other important things needed for enterprise use. The NFS option or the OCFS2 that I mentioned in a previous comment are much better options for using a shared disc as storage for servers.

M. David Peterson

@bisho,

>> But it seems that PersistentFS doesn't offer locking, shared access protection and other important things needed for enterprise use.

Oh, serious? I wasn't aware of this. For some reason I had thought it did.

>> The NFS option or the OCFS2 that I mentioned in a previous comment are much better options for using a shared disc as storage for servers.

Well given this new info regarding PersistentFS, then I can't help but agree. I need to dig deeper in OCFS2 as it seems it could certainly present some interesting advantages over NFS. I'll update both the paper and then this thread once I've had a chance to play around with things.

Thanks for the tip!

M. David Peterson

@bisho,

Just noticed this at the bottom of the page at http://drbd.org/

---
DRBD and cluster file systems

You can run DRBD either with one node in primary role and the other node in secondary role. This is recommended for classical fail-over clusters. You should do this as long as you use
a conventional journaling file system (ext3, XFS, JFS, etc...)

Since DRBD-8.0.0 you can run both nodes in the primary role, enabling to mount a cluster file system (a physical parallel file system) one both nodes concurrently. Examples for such file systems are OCFS2 and GFS.
---

I'm still trying to make better sense of this, but it seems what they are suggesting is that you can use DRBD in a classic fail-over solution or in a cluster file system where both nodes are actively reading and writing to and from one another.

I'll play around with things and follow-up again once I've made better sense of this.

bisho

@M. David Peterson:

OCFS2 is not NFS. For NFS you need a server that shares its disc to others by netword. With a cluster FS the disc medium is shared, and all the nodes just need to talk to each other to lock, sync, and make reservations, so they don't collide between them. You avoid the need of a dedicated machine serving just storage. The final servers could access directly storage and talk between them to do it properly.

DRDB copies blocks of one machine disc to another machine, through the network. Usually can block the write on the orig machine until the write is confirmed on the second machine if you choose the safest method, thus is slow. This emulates a physical shared disc.

But with the S3 disc, it seems that is already shared, so why don't just mount it on all machines that need access to the disc? You don't need to copy anything.

Of course, in order to be able to have the shared medium mounted R/W on all machines (RO is not a problem), you need a cluster FS, that handles locks and reservations between the nodes that share the FS. One cluster FS that works well is OCFS2 (Used in Oracle, now Open Source) but there are others.

Just configure the nodes, mount the same S3 disc on all machines with OCFS, and voilá, you will have a top speed shared medium.

I'm sure this will be faster than your option because: For every write, you have NFS going through the network, plus a copy sent to the other NFS server by DRDB (by network too), plus two writes to two S3 discs, that at least I suppose are on a different network. This is for sure more slow than OCFS accessing just 1 time to the shared S3 disc, and some minor traffic for locks and other synchronization between the nodes (telling them that something has changed, locks, reservation of blocks).

And about failover, of course you have failover with OCFS2. The medium is shared, so all the data written is still there. If there are two webservers with a shared OCFS2 disc, if one dies, the other will be ok and with the last bits of data written by the failed server still there.

And you save costs and infrastructure: With NFS you need: 2 servers just for NFS. 2 shared S3 discs that will contain the data twice. OCFS2 just need 1 shared disc and no "dedicated OCFS2 servers". It could be directly mounted on the servers that need the disc.

I can't test this (my account doesn't have S3 discs option and I doubt my company will be willing to pay for my testings), but I'm quite sure that a cluster FS is a better option.

bisho

@M. David Peterson:

About MogileFS: In Amazon platform it's not needed! it has no sense!

MogileFS just stores files assigning them a key. You can just use directly S3 for that. And S3 ensures that data won't be lost, internally probably is similar to MogileFS.

M. David Peterson

@bisho,

Just noticing your follow-up comments now. Sorry for the delayed follow-up!

>> OCFS2 is not NFS. For NFS you need a server that shares its disc to others by netword. With a cluster FS the disc medium is shared, and all the nodes just need to talk to each other to lock, sync, and make reservations, so they don't collide between them. You avoid the need of a dedicated machine serving just storage. The final servers could access directly storage and talk between them to do it properly.

Right, I do understand that part, but what I'm not sure about is whether or not OCFS2 handles redundancy, or if it's more or less like LVM but for clusters instead of a single machine. In other words, does it make more than one copy of the same file on different nodes such that if one of the nodes in the cluster goes down, you can still gain access to any of the files that it had stored on its block devices?

>> DRDB copies blocks of one machine disc to another machine, through the network. Usually can block the write on the orig machine until the write is confirmed on the second machine if you choose the safest method, thus is slow. This emulates a physical shared disc.

Sure, but that latency isn't seen by the connecting clients. In other words, the sync process between each DRBD node doesn't need to take place before the lock is released and the connecting node notified as such. So while there's always a risk that if the primary node goes down there could be some data loss, the overall R/W performance of the system isn't adversely effected by DRBD.

>> But with the S3 disc, it seems that is already shared, so why don't just mount it on all machines that need access to the disc? You don't need to copy anything.

By S3 disc, what are you referring to exactly? Do you mean AWS's persistent storage solution? If yes, that's not publicly available. While I personally have access to this service, its still in private alpha status, and there's no real understanding as to when it will become more widely available. So its not something that can be used by the masses at the moment.

>> Of course, in order to be able to have the shared medium mounted R/W on all machines (RO is not a problem), you need a cluster FS, that handles locks and reservations between the nodes that share the FS. One cluster FS that works well is OCFS2 (Used in Oracle, now Open Source) but there are others.

Well sure, but again, this isn't about increasing your overall disk capacity by sharing your disks across all available nodes. It's about ensuring you always have R/W access to all of your data, regardless of individual failure of a component. If OCFS2 offers redundancy, however, then there's certainly more to this. But I haven't been able to clarify one way or another if this is the case.

>> Just configure the nodes, mount the same S3 disc on all machines with OCFS, and voilá, you will have a top speed shared medium.

Oh, absolutely! And when AWS persistent storage becomes more widely available, I can assure you this is exactly where the focus will be. In the mean time, this solution gets us at least part of the way.

>> I'm sure this will be faster than your option because: For every write, you have NFS going through the network, plus a copy sent to the other NFS server by DRDB (by network too), plus two writes to two S3 discs, that at least I suppose are on a different network.

Wait. How did S3 disks come into this. I don't use anything related to S3 in my white paper. These are all ephemeral drives that are a part of each EC2 instance.

>> This is for sure more slow than OCFS accessing just 1 time to the shared S3 disc, and some minor traffic for locks and other synchronization between the nodes (telling them that something has changed, locks, reservation of blocks).

It would be a little bit faster because the writes are going to be spread out across a broader array of disks rather than a single mountpoint of which multiple machines will be accessing. In essence, what we're looking at is RAID 0 across multiple machines instead of a single machine with multiple block devices. But similar to RAID 0, as long as I am understanding OCFS2 correctly, if one disk goes the entire system is adversely effected. Of course, unlike RAID 0, if one node goes down, the rest can still function as far as writing data to the overall cluster is concerned. But without any redundancy, if I need access to the files contained on the failed nodes block device, my only recourse is to look to backups of the lost device, which means I'm now waiting for the reconstruction of the backed up data to complete before I gain access to that data. While it might be a little bit slower, this architecture keeps things up and running from an always available R/W perspective, which is its primary focus.

>> And about failover, of course you have failover with OCFS2. The medium is shared, so all the data written is still there. If there are two webservers with a shared OCFS2 disc, if one dies, the other will be ok and with the last bits of data written by the failed server still there.

Well sure. But what if I want access to the files that are contained on the failed node? This isn't just about always having somewhere to write data to. It's about having both read/write access to all data on the system at all times.

>> And you save costs and infrastructure: With NFS you need: 2 servers just for NFS. 2 shared S3 discs that will contain the data twice.

Again, I'm not sure what you mean by S3 disks.

>> OCFS2 just need 1 shared disc and no "dedicated OCFS2 servers". It could be directly mounted on the servers that need the disc.

Well if you only have one node in the cluster, its not a cluster. Why would you use a cluster file system on a single node?

>> I can't test this (my account doesn't have S3 discs option and I doubt my company will be willing to pay for my testings), but I'm quite sure that a cluster FS is a better option.

As far I am understanding things correctly, it's a completely different option. If it doesn't provide data redundancy such that if one node fails the other can pick up where it left off, then we're talking about two completely different things here.

M. David Peterson

@bisho,

>> About MogileFS: In Amazon platform it's not needed! it has no sense!

Hmmm.. Yes and no. Depends on how you are looking at it.

>> MogileFS just stores files assigning them a key. You can just use directly S3 for that. And S3 ensures that data won't be lost, internally probably is similar to MogileFS

Yes, this is true. But mounting S3 as a block device is not as easy as it sounds. PersistentFS does a good job of allowing you to mount S3 as a block device. But as you pointed out earlier, there is no support for locking, so it has its drawbacks as well. And you're always dealing with the added cost of both the network and the signing of each read/write request. So it has it's draw backs as well.

That said: Stay tuned... I'm working on an extension to my current solution that adds S3 to the mix and -- coupled with FUSE -- provides the same advantages provided by MogileFS.

bisho

@M. David Peterson:

> Right, I do understand that part, but what I'm not sure about is whether or not OCFS2 handles redundancy, or if it's more or less like LVM but for clusters instead of a single machine. In other words, does it make more than one copy of the same file on different nodes such that if one of the nodes in the cluster goes down, you can still gain access to any of the files that it had stored on its block devices?

No. The medium is shared. All cluster servers access the same disc, so all the files are accesible to all. The redundancy is just at the medium (for example a raid disc) With Amazon, the disc you are using is supposed to have redundancy. You won't lost it's data. So you just mount the *same* disc on all servers. And there is no double copies of files.

>Sure, but that latency isn't seen by the connecting clients. In other words, the sync process between each DRBD node doesn't need to take place before the lock is released and the connecting node notified as such. So while there's always a risk that if the primary node goes down there could be some data loss, the overall R/W performance of the system isn't adversely effected by DRBD.

I have observed some problems with drbd on some servers where I have been using it, plus usually nfs is not recommended as DB storage.

>> But with the S3 disc, it seems that is already shared, so why don't just mount it on all machines that need access to the disc? You don't need to copy anything.

> By S3 disc, what are you referring to exactly? Do you mean AWS's persistent storage solution? If yes, that's not publicly available. While I personally have access to this service, its still in private alpha status, and there's no real understanding as to when it will become more widely available. So its not something that can be used by the masses at the moment.

I mean that you can mount the same AWS's persistent storage disc in several servers, so it can be used as a shared disc, same as having a iSCSI NAS.

> Well sure, but again, this isn't about increasing your overall disk capacity by sharing your disks across all available nodes. It's about ensuring you always have R/W access to all of your data, regardless of individual failure of a component. If OCFS2 offers redundancy, however, then there's certainly more to this. But I haven't been able to clarify one way or another if this is the case.

Yes, with OCFS2 you have alwats R/W access to the disc. The disc is shared, so all nodes could read or write to it. They just need coordination, and that's what OCFS2 provides.

> Wait. How did S3 disks come into this. I don't use anything related to S3 in my white paper. These are all ephemeral drives that are a part of each EC2 instance.

Sorry, by S3 disc I mean the "AWS's persistent storage disc".

> It would be a little bit faster because the writes are going to be spread out across a broader array of disks rather than a single mountpoint of which multiple machines will be accessing. In essence, what we're looking at is RAID 0 across multiple machines instead of a single machine with multiple block devices.

I don't think so. The limiting factor is usually the network connection not the discs. And there is no phisical disc on Amazon. Probably one disc is already shared and RAID something.

Plus if really using different Amazon drives gives improvement (what I doubt) you can mount 10 AWS discs on all nodes, make a RAID0 of those, and OCFS2 on top of that.

> But similar to RAID 0, as long as I am understanding OCFS2 correctly, if one disk goes the entire system is adversely effected.

No, it's not.

> Of course, unlike RAID 0, if one node goes down, the rest can still function as far as writing data to the overall cluster is concerned. But without any redundancy

No, the disc that Amazon provides is not a phisical disc. Probably is space allocaded on a redundant array of discs, so you can trust it wont lost data.

Even if the data persistence is not assured on amazon, you can still mount several drives in shared mode, make raid on top of it, and OCFS2 on top of it.

> if I need access to the files contained on the failed nodes block device, my only recourse is to look to backups of the lost device, which means I'm now waiting for the reconstruction of the backed up data to complete before I gain access to that data. While it might be a little bit slower, this architecture keeps things up and running from an always available R/W perspective, which is its primary focus.

As I said, I think you are wrong. There is no lost of data if any node fails.

> Well sure. But what if I want access to the files that are contained on the failed node? This isn't just about always having somewhere to write data to. It's about having both read/write access to all data on the system at all times.

There is NO files on any node. The files are stored on a disc, external and shared between all the nodes. So all the nodes access the SAME disc, the SAME data. If one node dies, the other nodes have still access to all the data.

> Well if you only have one node in the cluster, its not a cluster. Why would you use a cluster file system on a single node?

If you just have one webserver, then you don't need a cluster FS, but either you need a NFS cluster. But if you have two webservers serving the same contents, the best way is just OCFS2 and 1 disc, shared between the two webservers. If one dies the other could still serve all files, unaffected.

> As far I am understanding things correctly, it's a completely different option. If it doesn't provide data redundancy such that if one node fails the other can pick up where it left off, then we're talking about two completely different things here.

No, really, please understand carefully how a cluster FS works. OCFS2 allows sharing one disc's data to all alive nodes, whithout any loss of data if one node fails. The data is not at the nodes, but in the shared disc.

M. David Peterson

@bisho,

>> No, really, please understand carefully how a cluster FS works. OCFS2 allows sharing one disc's data to all alive nodes, whithout any loss of data if one node fails. The data is not at the nodes, but in the shared disc.

Obviously I need to research OCFS2 in greater depth. What I think I'm missing is what you mean by "shared disc". Where does that shared disc exist? And how would that map to EC2 based on *todays* offering (i.e. minus the forthcoming persistent storage solution)? I do understand the notion of an OCFS2 shared disk and the future EC2 persistent storage offering could be mapped. But not how todays offering -- where there is no such thing as EC2 persistent storage -- could be mapped.

Is what you're suggesting is that when EC2 persistent storage becomes available implementing OCFS2 on top of it will be the better strategy?

bisho

A shared disc is typically a iSCSI NAS, on a separate cabinet.

You set all the nodes to access and share that disc. The disc itself is usually redundant. RAID-something, two motherboards on the same cabinet, two (or more) network connections... So the data could be assumed to be safe. The nodes store no-information. You could turn off one without problem. And you can add one if needed and share the same disc.

With Amazon the EC2 persistent disc is very interesting. If the same disc could be attached to several nodes at the same time (I really hope so), you could use it as a shared iSCSI NAS, but like at 1/1000 the cost, of course. You could not imagine how much cost a iSCSI NAS...

And for being able to mount the same EC2 persistent disc on several servers at the same time, in write mode, without problems and corruptiong, of course you need a cluster FS, like OCFS2 (but there are more).

And yes, I think it would be a great strategy.

If you can set a cloud of webservers, with a shared disc with OCFS2, and boot more nodes when needed, or shutdown some if there are few requests. And you don't need the slower option of having two nodes just for NFS.

M. David Peterson

@bisho,

>> You set all the nodes to access and share that disc. The disc itself is usually redundant. RAID-something, two motherboards on the same cabinet, two (or more) network connections... So the data could be assumed to be safe. The nodes store no-information. You could turn off one without problem. And you can add one if needed and share the same disc.

Okay, so I guess I assumed you were aware of the fact that this particular configuration is not possible with EC2, at least as far as the disk array residing within the EC2 data center. You could do it from an external perspective, but the performance hit would be so big as to make the gained benefits all but moot.

The only way to mimic this same type of architecture within the confines of the EC2 data center is to think of the two DRBD nodes in the papers architecture as the iSCSI NAS in a separate cabinet. Of course this isn't what is actually taking place, which is why NFS makes sense (UnionFS, an extension to NFS, provides some interesting opportunities as well) in this configuration. Of course, OCFS2 would work just as well, and there would probably be some performance gains as well. But I haven't had a chance to test it, something I hope to do before too long.

When that happens, I'll publish my findings and update this thread accordingly.

PersistentFS

PersistentFS supports file locking and shared access protection. If you have any questions or run into any difficulties using PersistentFS, please do not hesitate to contact us at http://www.PersistentFS.com/contact. Thank you.

M. David Peterson

@PersistentFS,

Thanks for the clarification!

bisho

@M. David Peterson:

> Okay, so I guess I assumed you were aware of the fact that this particular configuration is not possible with EC2, at least as far as the disk array residing within the EC2 data center. You could do it from an external perspective, but the performance hit would be so big as to make the gained benefits all but moot.

I don't understand you... Why is not possible? Amazon's EC2 persistent volumes that could be attached virtually to servers *ARE* network-based discs, and Amazon ensures they won't be lost, same as guarantees S3 data is not lost. Also Amazon says "Volumes are designed for high throughput, low latency access from Amazon EC2" so it will be fast.

So if you can attach the same volume to two or more nodes at the same time, is exactly the same configuration as a NAS storage, so OCFS2 is *definitely* the best way to go. You don't need anything else! And drbd is not needed!

Even in you current configuration, drdb is redundant. Attach the same EC2 persisten volume to the two nfs servers. The primary server will mount the volumen in read-write, and the other won't mount it yet, just will have access to the device. If the first dies, just mount the device on the second nfs server and start the nfs. All data written by the first NFS will be there, as it's actually the *SAME* device. Why you need drdb? It is a persistent network volume, shared between nodes.

M. David Peterson

@bisho,

I agree: EC2 persistent storage is by far and beyond the superior solution. But it's not available today to anyone other than a handful of private alpha participants. As Jeff points out, in the paper I specify,

>> "the primary focus of this paper is to present both a detailed overview as well as a working code base that will enable you to begin designing, building, testing, and deploying your EC2-based applications using a generalized persistent storage foundation, doing so today in both lieu of and in preparation for release of Amazon Web Services offering in this same space."

In other words, this provides a solution that you can use *today*, and when EC2 persistent storage becomes more readily available, you can then adapt your solution to use EC2 persistent storage instead. At that point, you are right: There will be no need for DRBD. But at the moment, DRBD is the only way to gain data persistence.

bisho

@M. David Peterson:

Ahhh... Ok! Now I understand what you where saying:) Of course OCFS2 is not an option currently, it needs the EC2 persistent storage that is still a closed beta.

It would be great if you, that have access to the beta program, test this kind of arrangement. And I would love also to test this and write an article about it (maybe there is a free place on the beta program? ;) hehehe)

Bye!

The comments to this entry are closed.

Featured Events

The AWS Report


Brought to You By

Jeff Barr (@jeffbarr):



Jinesh Varia (@jinman):


Email Subscription

Enter your email address:

Delivered by FeedBurner

April 2014

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30