Tuesday, May 6, 2014

Update to Differential Synchronization in SageMathCloud

I've just pushed out a major update to how synchronization works in https://cloud.sagemath.com.

This change is pretty significant behind the scenes, but the only difference you should notice is that everything should be better. In particular:

  • evaluation of code in Sage worksheet should feel a little snappier and more robust,
  • various random and hard to reproduce issues with synchronized editing should be fixed, e.g. chat messages out of order, etc.
  • everything should generally be a bit faster and more scalable overall.

Here's a short technical description of what changed. The basic architecture of SageMathCloud is that there are many web browsers connected to many hubs, which are in turn connected to your project (and to many other projects too):

  [web browser] <- websocket ----\/
  [web browser] <------------> [hub]<------ tcp -------\/
                                                     [project]
  [web browser] <------------> [hub]<------------------/\

Until today, the differential synchronization implementation involved having a copy of the document you're editing on:

  1. each hub pictured above,
  2. in each browser, and
  3. in the project itself.

In particular, there were three slightly different implementations of differential synchronization running all over the place. The underlying core code is the same for all three, but the way it is used in each case is different, due to different constraints. The implementations:

  • browser: running in a web browser, which mainly has to worry about dealing with the CodeMirror editor and a flakie Internet connection.
  • hub: running in a node.js server that's also handling a lot of other stuff, including worrying about auth, permissions, proxying, logging, account creation, etc.
  • project: running in the project, which doesn't have to worry about auth or proxying or much else, but does have to worry about the filesystem.

Because we're using Node.js, all three implementations are written in the same language (CoffeeScript), and run the same underlying core code (which I BSD licensed at https://github.com/sagemath/cloud/blob/master/diffsync.coffee). The project implementation was easiest to write, since it's very simple and straightforward, and has minimal constraints. The browser implementation is mainly difficult, since the Internet comes and goes (as laptops suspend/resume), and it this involves patching and diff'ing a CodeMirror editor instance; CodeMirror is difficult, because it views the document as a line instead of a single long string, and we want things to work even for documents with hundreds of thousands of lines, so converting back and forth to a string is not an option! Implementing the hub part of synchronization is the hardest, for various reasons -- and debugging it is particularly hard. Moreover, computing diffs can be computationally expensive if the document is large, so doing anything involving differential sync on the hub can result in nontrivial locking cpu usage, hence slower handling of other user messages (node.js is single threaded). The hub part of the above was so hard to get right that it had some nasty locking code, which shouldn't be needed, and just looked like a mess.

A lot of issues people ran into with sync involved two browsers connected to different hubs, who then connected to the same document in a project. The two hubs' crappy synchronization would appear to work right in this part of the picture "[web browser] <------------> [hub]", but have problems with this part "[hub]<-------------->[project]", which would lead to pain later on. In many cases, the only fix was to restart the hub (to kill its sync state) or for the user to switch hubs (by clearing their browser cookies).

Change: I completely eliminated the hub from the synchronization picture. Now the only thing the hub does related to sync is forward messages back and forth between the web browser and local hub. Implementing this was harder than one might think, because the the project considered each client to be a single tcp connection, but now many clients can connect a project via the same tcp connection, etc.

With this fix, if there are any bugs left with synchronization, they should be much easier to debug. The backend scalability and robustness of sync have been my top priorities for quite a while now, so I'm happy to get this stuff cleaned up, and move onto the next part of the SMC project, which is better collaboration and course support.

Thursday, May 1, 2014

What can SageMathCloud (SMC) do?

The core functionality of SageMathCloud:

  • Color command-line Terminal with many color schemes, which several people can interact with at once, with state that survives browser refresh.
  • Editing of documents: with syntax highlighting, auto-indent, etc., for files with the following extensions:
    c, c++, cql, cpp, cc, conf, csharp, c#, coffee, css, diff, dtd, e, ecl, f, f90, f95, h, hs, lhs, html, java, jl, js, lua, m, md, mysql, patch, gp, go, pari, php, py, pyx, pl, r, rst, rb, ru, sage, sagews, scala, sh, spyx, sql, txt, tex, toml, bib, bbl, xml, yaml.
(It's easy for me to add more, as CodeMirror supports them.) There are many color schemes and Emacs and Vim bindings.
  • Sage Worksheets: a single document interactive way to evaluate Sage code. This is highly extensible, in that you can define % modes by simply making a function that takes a string as input, and use %default_mode to make that mode the default. Also, graphics actually work in the %r automatically, exactly as in normal R (no mucking with devices or png's).
  • IPython notebooks: via an IPython session that is embedded in an iframe. This is synchronized, so that multiple users can interact with a notebook simultaneously, which was a nontrivial addition on top of IPython.
  • LaTeX documents: This fully supports sagetex, bibtex, etc. and the LaTeX compile command is customizable. This also has forward and inverse search, i.e., double click on preview to get to point in tex file and alt-enter in tex file to get to point in LaTeX document. In addition, this editor will work fine with 150+ page documents by design. (Editing multiple document files are not properly supported yet.)
  • Snapshots: the complete state of all files in your project are snapshotted (using bup, which is built on git) every 2 minutes, when you're actively editing a file. All of these snapshots are also regularly backed up to encrypted disks offsite, just in case. I plan for these highly efficient deduplicated compressed snapshots to be saved indefinitely. Browse the snapshots by clicking "Snapshots" to the right when viewing your files or type cd ~/.snapshots/master in the terminal.
  • Replication: every project is currently stored in three physically separate data centers; if a machine or data center goes down, your project pops up on another machine within about one minute. A computer at every data center would have to fail for your project to be inaccessible. I've had zero reports of projects being unavailable since I rolled out this new system 3 weeks ago (note: there was a project that didn't work, but that was because I had set the quota to 0% cpu by accident).

Sage

The Sage install contains the following extra packages (beyond what is standard in Sage itself). When you use Sage or IPython, this will all be available.
basemap, biopython, biopython, bitarray, brian, cbc, chomp, clawpack, cluster_seed, coxeter3, cryptominisat, cunningham_tables, database_cremona_ellcurve, database_gap, database_jones_numfield, database_kohel, database_odlyzko_zeta, database_pari, database_symbolic_data, dot2tex, fabric, gap_packages, gnuplotpy, greenlet, guppy, h5py, httplib2, kash3, lie, lrs, lxml, mahotas, mercurial, mpld3, munkres, mysql-python, nauty, netcdf4, neuron, normaliz, nose, nose, numexpr, nzmath, oct2py, p_group_cohomology, pandas, paramiko, patsy, patsy, phc, plotly, psutil, psycopg2, pybtex, pycryptoplus, pyface, pymongo, pyproj, pyx, pyzmq, qhull, quantlib, redis, requests, rpy2, scikit_learn, scikits-image, scimath, shapely, simpy, snappy, statsmodels, stein-watkins-ecdb, tables, theano, topcom, tornado, traits, xlrd, xlwt, zeromq

R

Also, I install the following extra packages into the R that is in Sage:
KernSmooth, Matrix, Rcpp, cairo, car, circular, cluster, codetools, e1071, fields, ggplot2, glmnet, lattice, mgcv, mvtnorm, plyr, reshape2, rpart, stringr, survival, zoo

It's Linux

SMC can do pretty much anything that doesn't require X11 that can be done with an Ubuntu-14.04 Linux can be done. I've pre-installed the following packages, and if people want others, just let me know (and they will be available to all projects henceforth):
vim git wget iperf dpkg-dev make m4 g++ gfortran liblzo2-dev libssl-dev libreadline-dev  libsqlite3-dev libncurses5-dev git zlib1g-dev openjdk-7-jdk libbz2-dev libfuse-dev pkg-config libattr1-dev libacl1-dev par2 ntp pandoc ssh python-lxml  calibre  ipython python-pyxattr python-pylibacl software-properties-common  libevent-dev xfsprogs lsof  tk-dev  dstat emacs vim texlive texlive-* gv imagemagick octave mercurial flex bison unzip libzmq-dev uuid-dev scilab axiom yacas octave-symbolic quota quotatool dot2tex python-numpy python-scipy python-pandas python-tables libglpk-dev python-h5py zsh python3 python3-zmq python3-setuptools cython htop ccache python-virtualenv clang libgeos-dev libgeos++-dev sloccount racket libxml2-dev libxslt-dev irssi libevent-dev tmux sysstat sbcl gawk noweb libgmp3-dev ghc  ghc-doc ghc-haddock ghc-mod ghc-prof haskell-mode haskell-doc subversion cvs bzr rcs subversion-tools git-svn markdown lua5.2 lua5.2-*  encfs auctex vim-latexsuite yatex spell cmake libpango1.0-dev xorg-dev gdb valgrind doxygen haskell-platform haskell-platform-doc haskell-platform-prof  mono-devel mono-tools-devel ocaml ocaml-doc tuareg-mode ocaml-mode libgdbm-dev mlton sshfs sparkleshare fig2ps epstool libav-tools python-software-properties software-properties-common h5utils libnetcdf-dev netcdf-doc netcdf-bin tig libtool iotop asciidoc autoconf bsdtar attr  libicu-dev iceweasel xvfb tree bindfs liblz4-tool tinc  python-scikits-learn python-scikits.statsmodels python-skimage python-skimage-doc  python-skimage-lib python-sklearn  python-sklearn-doc  python-sklearn-lib python-fuse cgroup-lite cgmanager-utils cgroup-bin libpam-cgroup cgmanager cgmanager-utils cgroup-lite  cgroup-bin r-recommended libquantlib0 libquantlib0-dev quantlib-examples quantlib-python quantlib-refman-html quantlib-ruby r-cran-rquantlib  libf2c2-dev libpng++-dev libcairomm-1.0-dev r-cran-cairodevice x11-apps mesa-utils libpangox-1.0-dev
I've also put extra effort (beyond just apt-get) to install the following:
polymake, dropbox, aldor/"AXIOM", Macaulay2, Julia, 4ti2

Functionality that is currently under development

We're working hard on improving SageMathCloud right now.

  • Streamlining document sync: will make code evaluation much faster, eliminate some serious bugs when the network is bad, etc.
  • Geographic load balancing and adding data centers, so that, e.g., if you're in Europe or Asia you can use SMC with everything happening there. This will involve DNS load balancing via Amazon Route 53, and additionally moving projects to run on the DC that is nearest you on startup, rather than random. Right now all computers are in North America.
  • Mounting a folder of one project in another project, in a way that automatically fixes itself in case a machine goes down, etc. Imagine mounting the projects of all 50 students in your class, so you can easily assign and collect homework, etc.
  • Homework assignment and grading functionality with crowdsourcing of problem creation, and support for peer and manual grading.
  • BSD-licensed open source single-project version of SMC.
  • Commercial software support and instructions for how to install your own into SMC (e.g., Mathematica, Matlab, Magma, etc.)
  • ssh access into projects easily