Sharing and archiving data sets with Dat
Dat is a new peer-to-peer protocol that uses some of the concepts of BitTorrent and Git. Dat primarily targets researchers and open-data activists as it is a great tool for sharing, archiving, and cataloging large data sets. But it can also be used to implement decentralized web applications in a novel way.
Dat quick primer
Dat is written in JavaScript, so it can be installed with npm
, but
there are standalone binary
builds and a desktop
application (as an AppImage). An
online viewer can be used to inspect data for
those who do not want to install arbitrary binaries on their computers.
The command-line application allows basic operations like downloading existing data sets and sharing your own. Dat uses a 32-byte hex string that is an ed25519 public key, which is is used to discover and find content on the net. For example, this will download some sample data:
$ dat clone \
dat://778f8d955175c92e4ced5e4f5563f69bfec0c86cc6f670352c457943666fe639 \
~/Downloads/dat-demo
Similarly, the share
command is used to share content. It indexes the
files in a given directory and creates a new unique address like the one
above. The share
command starts a server that uses multiple discovery
mechanisms (currently, the Mainline Distributed Hash
Table (DHT), a custom DNS
server, and multicast DNS)
to announce the content to its peers. This is how another user, armed
with that public key, can download that content with dat clone
or
mirror the files continuously with dat sync
.
So far, this looks a lot like BitTorrent magnet links updated with 21st century cryptography. But Dat adds revisions on top of that, so modifications are automatically shared through the swarm. That is important for public data sets as those are often dynamic in nature. Revisions also make it possible to use Dat as a backup system by saving the data incrementally using an archiver.
While Dat is designed to work on larger data sets, processing them for sharing may take a while. For example, sharing the Linux kernel source code required about five minutes as Dat worked on indexing all of the files. This is comparable to the performance offered by IPFS and BitTorrent. Data sets with more or larger files may take quite a bit more time.
One advantage that Dat has over IPFS is that it doesn't duplicate the
data. When IPFS imports new data, it duplicates the files into
~/.ipfs
. For collections of small files like the kernel, this is not a
huge problem, but for larger files like videos or music, it's a
significant limitation. IPFS eventually implemented a solution to this
problem in the form of the
experimental filestore
feature,
but it's not enabled by default. Even with that feature enabled,
though, changes to data sets are not automatically tracked. In
comparison, Dat operation on dynamic data feels much lighter. The
downside is that each set needs its own dat share
process.
Like any peer-to-peer system, Dat needs at least one peer to stay online to offer the content, which is impractical for mobile devices. Hosting providers like Hashbase (which is a pinning service in Dat jargon) can help users keep content online without running their own server. The closest parallel in the traditional web ecosystem would probably be content distribution networks (CDN) although pinning services are not necessarily geographically distributed and a CDN does not necessarily retain a complete copy of a website.
A web browser called Beaker, based on the
Electron framework, can access Dat content
natively without going through a pinning service. Furthermore, Beaker is
essential to get any of the Dat
applications working, as they
fundamentally rely on dat://
URLs to do their magic. This means that
Dat applications won't work for most users unless they install that
special web browser. There is a Firefox
extension
called "dat-fox" for people
who don't want to install yet another browser, but it requires
installing a helper
program. The extension
will be able to load dat://
URLs but many applications will still not
work. For example, the photo gallery
application completely
fails with dat-fox.
Dat-based applications look promising from a privacy point of view. Because of its peer-to-peer nature, users regain control over where their data is stored: either on their own computer, an online server, or by a trusted third party. But considering the protocol is not well established in current web browsers, I foresee difficulties in adoption of that aspect of the Dat ecosystem. Beyond that, it is rather disappointing that Dat applications cannot run natively in a web browser given that JavaScript is designed exactly for that.
Dat privacy
An advantage Dat has over other peer-to-peer protocols like BitTorrent is end-to-end encryption. I was originally concerned by the encryption design when reading the academic paper [PDF]:
It is up to client programs to make design decisions around which discovery networks they trust. For example if a Dat client decides to use the BitTorrent DHT to discover peers, and they are searching for a publicly shared Dat key (e.g. a key cited publicly in a published scientific paper) with known contents, then because of the privacy design of the BitTorrent DHT it becomes public knowledge what key that client is searching for.
So in other words, to share a secret file with another user, the public key is transmitted over a secure side-channel, only to then leak during the discovery process. Fortunately, the public Dat key is not directly used during discovery as it is hashed with BLAKE2B. Still, the security model of Dat assumes the public key is private, which is a rather counterintuitive concept that might upset cryptographers and confuse users who are frequently encouraged to type such strings in address bars and search engines as part of the Dat experience. There is a security & privacy FAQ in the Dat documentation warning about this problem:
One of the key elements of Dat privacy is that the public key is never used in any discovery network. The public key is hashed, creating the discovery key. Whenever peers attempt to connect to each other, they use the discovery key.
Data is encrypted using the public key, so it is important that this key stays secure.
There are other privacy issues outlined in the document; it states that "Dat faces similar privacy risks as BitTorrent":
When you download a dataset, your IP address is exposed to the users sharing that dataset. This may lead to honeypot servers collecting IP addresses, as we've seen in Bittorrent. However, with dataset sharing we can create a web of trust model where specific institutions are trusted as primary sources for datasets, diminishing the sharing of IP addresses.
A Dat blog post refers to this issue as reader privacy and it is, indeed, a sensitive issue in peer-to-peer networks. It is how BitTorrent users are discovered and served scary verbiage from lawyers, after all. But Dat makes this a little better because, to join a swarm, you must know what you are looking for already, which means peers who can look at swarm activity only include users who know the secret public key. This works well for secret content, but for larger, public data sets, it is a real problem; it is why the Dat project has avoided creating a Wikipedia mirror so far.
I found another privacy issue that is not documented in the security FAQ during my review of the protocol. As mentioned earlier, the Dat discovery protocol routinely phones home to DNS servers operated by the Dat project. This implies that the default discovery servers (and an attacker watching over their traffic) know who is publishing or seeking content, in essence discovering the "social network" behind Dat. This discovery mechanism can be disabled in clients, but a similar privacy issue applies to the DHT as well, although that is distributed so it doesn't require trust of the Dat project itself.
Considering those aspects of the protocol, privacy-conscious users will probably want to use Tor or other anonymization techniques to work around those concerns.
The future of Dat
Dat 2.0 was released in June 2017 with performance improvements and protocol changes. Dat Enhancement Proposals (DEPs) guide the project's future development; most work is currently geared toward implementing the draft "multi-writer proposal" in HyperDB. Without multi-writer support, only the original publisher of a Dat can modify it. According to Joe Hand, co-executive-director of Code for Science & Society (CSS) and Dat core developer, in an IRC chat, "supporting multiwriter is a big requirement for lots of folks". For example, while Dat might allow Alice to share her research results with Bob, he cannot modify or contribute back to those results. The multi-writer extension allows for Alice to assign trust to Bob so he can have write access to the data.
Unfortunately, the current proposal doesn't solve the "hard problems" of "conflict merges and secure key distribution". The former will be worked out through user interface tweaks, but the latter is a classic problem that security projects have typically trouble finding solutions for---Dat is no exception. How will Alice securely trust Bob? The OpenPGP web of trust? Hexadecimal fingerprints read over the phone? Dat doesn't provide a magic solution to this problem.
Another thing limiting adoption is that Dat is not packaged in any distribution that I could find (although I requested it in Debian) and, considering the speed of change of the JavaScript ecosystem, this is unlikely to change any time soon. A Rust implementation of the Dat protocol has started, however, which might be easier to package than the multitude of Node.js modules. In terms of mobile device support, there is an experimental Android web browser with Dat support called Bunsen, which somehow doesn't run on my phone. Some adventurous users have successfully run Dat in Termux. I haven't found an app running on iOS at this point.
Even beyond platform support, distributed protocols like Dat have a tough slope to climb against the virtual monopoly of more centralized protocols, so it remains to be seen how popular those tools will be. Hand says Dat is supported by multiple non-profit organizations. Beyond CSS, Blue Link Labs is working on the Beaker Browser as a self-funded startup and a grass-roots organization, Digital Democracy, has contributed to the project. The Internet Archive has announced a collaboration between itself, CSS, and the California Digital Library to launch a pilot project to see "how members of a cooperative, decentralized network can leverage shared services to ensure data preservation while reducing storage costs and increasing replication counts".
Hand said adoption in academia has been "slow but steady" and that the Dat in the Lab project has helped identify areas that could help researchers adopt the project. Unfortunately, as is the case with many free-software projects, he said that "our team is definitely a bit limited on bandwidth to push for bigger adoption". Hand said that the project received a grant from Mozilla Open Source Support to improve its documentation, which will be a big help.
Ultimately, Dat suffers from a problem common to all peer-to-peer
applications, which is naming. Dat addresses are not exactly intuitive:
humans do not remember strings of 64 hexadecimal characters well. For
this, Dat took a similar
approach
to IPFS by using DNS TXT
records and /.well-known
URL paths to
bridge existing, human-readable names with Dat hashes. So this
sacrifices a part of the decentralized nature of the project in favor of
usability.
I have tested a lot of distributed protocols like Dat in the past and I am not sure Dat is a clear winner. It certainly has advantages over IPFS in terms of usability and resource usage, but the lack of packages on most platforms is a big limit to adoption for most people. This means it will be difficult to share content with my friends and family with Dat anytime soon, which would probably be my primary use case for the project. Until the protocol reaches the wider adoption that BitTorrent has seen in terms of platform support, I will probably wait before switching everything over to this promising project.
This article first appeared in the Linux Weekly News.