Friday, April 25, 2014

The SageMathCloud Roadmap

Everything below is subject to change.

Implementation Goals

  • (by April 27) Major upgrades -- update everything to Ubuntu 14.04 and Sage-6.2. Also upgrade all packages in SMC, including Haproxy, nginx, stunnel, etc.

  • (by May 4) Streamline doc sync: top priority right now is to clean up sync, and eliminate bugs that show up when network is bad, many users, etc.

  • (by May 30) Snapshots:
    • more efficient way to browse snapshot history (timeline view)
    • browse snapshots of a single file (or directory) only
  • (by May 30) User-owned backups
    • way to download complete history of a project, i.e,. the underlying bup/git repository with snapshot history.
    • way to update offline backup, getting only the changes since last download.
    • easy way to download all current files as a zip or tarball (without the snapshots).
  • (by June 30) Public content
    • Ability to make a read-only view of the project visible publicly on the internet. Only works after the account is at least n days old.
    • By default, users will have a "report spammer" button on each page. Proven 'good users' will have button removed. Any valid reported users will be permanently banned.
    • Only users with validated .edu accounts will be allowed to publish for starters. Maybe allow gmail.com at some point.
  • (by June 30) Fix all major issues (none major) that are listed on the github page: https://github.com/sagemath/cloud/issues

  • (by July 31) Group/social features:
    • Support for mounting directories between projects
    • Group management: combine a group of users and projects into a bigger entity:
      • a University course -- see feature list below
      • a research group: a collection of projects with a theme that connects them, where everybody has access to everybody else's projects
    • A feed that shows activity on all projects that users care about, with some ranking. Better notifications about chat messages and activity

Commercial Products

We plan four distinct products of the SMC project: increased quotas, enhanced university course support, license and support to run a private SMC cloud, supported open source BSD-licensed single-user version of SMC (and offline mode).

  • PRODUCT: Increase the quota for a specific project (launch by Aug 2014)
    • cpu cores
    • RAM
    • timeout
    • disk space
    • number of share mounts
  • Remarks:
    • There will be an option in the UI to change each of the above parameters that some project collabs see (maybe only owners initially).
    • Within moments of making a change it goes live and billing starts.
    • When the change is turned off, billing stops. When a project is not running it is not billed. (Obviously, we need to add a stop button for projects.)
    • There is a maximum amount that the user can pre-spend (e.g., $500 initially).
    • At the end of the month, the user is given a link to a Univ of Washington website and asked to pay a certain amount, and register there under the same email as they use with SMC.
    • When they pay, SMC receives an email and credits their account for the amount they pay.
    • There will also be a limit soon on the number of projects that can be associated with an account (e.g., 10); pay a small monthly to raise this.
  • PRODUCT: University course support (launch by Aug 2014 in time for Fall 2014 semester)

    • Free for the instructor and TA's
    • Each student pays $20 in exchange for:
      • one standard project (they can upgrade quotas as above), which TA and instructor are automatically collabs on
      • student is added as collaborator to a big shared project
      • in student's private project they get homework assignments (assigned, collected)
    • Instructor's project has all student projects as mounted shares
    • Instructor has a student data spreadsheet with student grades, project ids (links), etc.
    • Powerful modern tool for designing homework problems that can be automatically graded, with problems shared in a common pool, with ratings, and data about their usage.
    • A peer grading system for more advanced courses.
    • Tools to make manual grading more fun.
  • PRODUCT: License to run a private SMC cloud in a research lab, company, supercomputer, etc. (launch a BETA version by July 2014, with caveats about bugs).

    • base fee (based on organization size)
    • technical support fee
    • site visit support: install, run workshop/class
  • PRODUCT: Free BSD-licensed single-user account version of SMC (launch by December 2014)

    • a different way to do LaTeX editing, manage a group of IPython notebooks, use Sage worksheets, etc.
    • be included with Sage.
    • be included in many Linux distros
    • the doc synchronization code, local_hub, CoffeeScript client, terminal.
    • mostly Node.js application (with a little Python for Sage/IPython).
    • ability to sync with a cloud-hosted SMC project.
    • sell pre-configured (or just support) for a user to install standalone-SMC on some cloud host such as EC2 or Digital Ocean

Tuesday, April 15, 2014

SageMathCloud's new storage architecture

Keywords: ZFS, bup, rsync, Sage

SageMathCloud (SMC) is a browser-based hosted cloud computing environment for easily collaborating on Python programs, IPython notebooks, Sage worksheets and LaTeX documents. I spent the last four months wishing very much that less people would use SMC. Today that has changed, and this post explains some of the reasons why.

Consistency Versus Availability

Consistency and availability are competing requirements. It is trivial to keep the files in a SageMathCloud project consistent if we store it in exactly one place; however, when the machine that project is on goes down for any reason, the project stops working, and the users of the project are very unhappy. By making many copies of the files in a project, it's fairly easy to ensure that the project is always available, even if network switches in multiple data centers completely fail, etc. Unfortunately, if there are too many users and the synchronization itself puts too heavy of a load on the overall system, then machines will fail more frequently, and though projects are available, files do not stay consistent and data is lost to the user (though still "out there" somewhere for me to find).

Horizontal scalability of file storage and availability of files are also competing requirements. If there are a few compute machines in one place, then they can all mount user files from one central file server. Unfortunately, this approach leads to horrible performance if instead the network is slow and has high latency; it also doesn't scale up to potentially millions of users. A benchmark I care about is downloading a Sage binary (630MB) and extracting it (creating over 70,000 files); I want this to take at most 3 minutes total, which is hard using a networked filesystem served over the general Internet between data centers. Instead, in SMC, we store the files for user projects on the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem, so in many cases read and write speeds are nearly twice as fast as they might be otherwise.

New Architecture of SageMathCloud

An SMC project with id project_id consists of two directories of files, replicated across several machines using rsync:
  1. The HOME directory: /projects/project_id
  2. A bup repository: /bup/bups/project_id
Users can also create files they don't care too much about in /scratch, which is a compressed and deduplicated ZFS filesystem. It is not backed up in any way, and is local to that compute.

The /projects directory is one single big ZFS filesystem, which is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication is much more subtle, as deduplication is tricky to do right. Since data can be deleted at any time, one can't just use a bloom filter to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute machines all have sufficient RAM (around 30-60GB), and the total data stored in /projects is well under 1TB. In fact, right now most compute machines have about 100GB stored in /projects.
The /bup/bups directory is also one single big ZFS filesystem; however, it is neither compressed nor deduplicated. It contains bup repositories, where bup is an awesome git-based backup tool written in Python that is designed for storing snapshots of potentially large collections of arbitrary files in a compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated, and bup itself is highly efficient at deduplication, we would gain almost nothing by using compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup does a much better job at deduplication. Right now, most compute machines have about 50GB stored in /bup/bups.

When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes. These snapshots are done using bup and stored in /bup/bups/project_id, as mentioned above. After a snapshot is successfully created, the files in the working directory and in the bup repository are copied via rsync to each replica node. The users of the project do not have direct access to /bup/bups/project_id, since it is of vital importance that these snapshots cannot be corrupted or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that even if they mess up all your files, you can easily get them back. However, all snapshots are mounted at /projects/project_id/.snapshots and browseable by the user; this uses bup's FUSE filesystem support, enhanced with some patches I wrote to support file permissions, sizes, change times, etc. Incidentally, the bup snapshots have no impact on the user's disk quota.

We also backup all of the bup archives (and the database nodes) to a single large bup archive, which we regularly backup offsite on encrypted USB drives. Right now, with nearly 50,000 projects, the total size of this large bup archive is under 250GB (!), and we can use it efficiently recover any particular version of any file in any project. The size is relatively small due to the excellent deduplication and compression that bup provides.

In addition to the bup snapshots, we also create periodic snapshots of the two ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted. These are accessible to users if they search around enough with the command line, but are not consistent between different hosts of the project, hence using them is not encouraged. This ensures that even if the whole replication/bup system were to somehow mess up a project, I can still recover everything exactly as it was before the problem happened; so far there haven't been any reports of problems.

Capacity

Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is $0.04 GB / month. So space-wise we could scale up by a factor of 100 without too much trouble. The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected very actively modifying 89 projects. The architecture that we have built could scale up to a million users, if only they would come our way...