================== DAWG documentation ================== This package provides DAWG(DAFSA_)-based dictionary-like read-only objects for Python (2.x and 3.x). String data in a DAWG may take 200x less memory than in a standard Python dict and the raw lookup speed is comparable; it also provides fast advanced methods like prefix search. Based on `dawgdic`_ C++ library. .. _dawgdic: https://code.google.com/p/dawgdic/ .. _libb64: http://libb64.sourceforge.net/ .. _DAFSA: https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton License ======= Wrapper code is licensed under MIT License. Bundled `dawgdic`_ C++ library is licensed under BSD license. Bundled libb64_ is Public Domain. Installation ============ From `PyPI `_:: pip install DAWG Usage ===== There are several DAWG classes in this package: * ``dawg.DAWG`` - basic DAWG wrapper; it can store unicode keys and do exact lookups; * ``dawg.CompletionDAWG`` - ``dawg.DAWG`` subclass that supports key completion and prefix lookups (but requires more memory); * ``dawg.BytesDAWG`` - ``dawg.CompletionDAWG`` subclass that maps unicode keys to lists of ``bytes`` objects. * ``dawg.RecordDAWG`` - ``dawg.BytesDAWG`` subclass that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python ``struct`` module). * ``dawg.IntDAWG`` - ``dawg.DAWG`` subclass that maps unicode keys to integer values. * ``dawg.IntCompletionDAWG`` - ``dawg.CompletionDAWG`` subclass that maps unicode keys to integer values. DAWG and CompletionDAWG ----------------------- ``DAWG`` and ``CompletionDAWG`` are useful when you need fast & memory efficient simple string storage. These classes does not support assigning values to keys. ``DAWG`` and ``CompletionDAWG`` constructors accept an iterable with keys:: >>> import dawg >>> words = [u'foo', u'bar', u'foobar', u'foö', u'bör'] >>> base_dawg = dawg.DAWG(words) >>> completion_dawg = dawg.CompletionDAWG(words) It is then possible to check if the key is in a DAWG:: >>> u'foo' in base_dawg True >>> u'baz' in completion_dawg False It is possible to find all keys that starts with a given prefix in a ``CompletionDAWG``:: >>> completion_dawg.keys(u'foo') >>> [u'foo', u'foobar'] to test whether some key begins with a given prefix:: >>> completion_dawg.has_keys_with_prefix(u'foo') >>> True and to find all prefixes of a given key:: >>> base_dawg.prefixes(u'foobarz') [u'foo', u'foobar'] Iterator versions are also available:: >>> for key in completion_dawg.iterkeys(u'foo'): ... print(key) foo foobar >>> for prefix in base_dawg.iterprefixes(u'foobarz'): ... print(prefix) foo foobar It is possible to find all keys similar to a given key (using a one-way char translation table):: >>> replaces = dawg.DAWG.compile_replaces({u'o': u'ö'}) >>> base_dawg.similar_keys(u'foo', replaces) [u'foo', u'foö'] >>> base_dawg.similar_keys(u'foö', replaces) [u'foö'] >>> base_dawg.similar_keys(u'bor', replaces) [u'bör'] BytesDAWG --------- ``BytesDAWG`` is a ``CompletionDAWG`` subclass that can store binary data for each key. ``BytesDAWG`` constructor accepts an iterable with ``(unicode_key, bytes_value)`` tuples:: >>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', b'value3')] >>> bytes_dawg = dawg.BytesDAWG(data) There can be duplicate keys; all unique values are stored in this case:: >>> bytes_dawg[u'key1'] [b'value1, b'value3'] For unique keys a list with a single value is returned for consistency:: >>> bytes_dawg[u'key2'] [b'value2'] ``KeyError`` is raised for missing keys; use ``get`` method if you need a default value instead:: >>> bytes_dawg.get(u'foo', None) None ``BytesDAWG`` support ``keys``, ``items``, ``iterkeys`` and ``iteritems`` methods (they all accept optional key prefix). There is also support for ``similar_keys``, ``similar_items`` and ``similar_item_values`` methods. RecordDAWG ---------- ``RecordDAWG`` is a ``BytesDAWG`` subclass that automatically packs & unpacks the binary data from/to Python objects using ``struct`` module from the standard library. First, you have to define a format of the data. Consult Python docs (http://docs.python.org/library/struct.html#format-strings) for the format string specification. For example, let's store 3 short unsigned numbers (in a Big-Endian byte order) as values:: >>> format = ">HHH" ``RecordDAWG`` constructor accepts an iterable with ``(unicode_key, value_tuple)``. Let's create such iterable using ``zip`` function:: >>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [(1, 2, 3), (2, 1, 0), (3, 3, 3), (2, 1, 5)] >>> data = zip(keys, values) >>> record_dawg = RecordDAWG(format, data) As with ``BytesDAWG``, there can be several values for the same key:: >>> record_dawg['foo'] [(1, 2, 3), (2, 1, 5)] >>> record_dawg['foobar'] [(3, 3, 3)] BytesDAWG and RecordDAWG implementation details ----------------------------------------------- ``BytesDAWG`` and ``RecordDAWG`` stores data at the end of the keys:: Data is encoded to base64 because dawgdic_ C++ library doesn't allow zero bytes in keys (it uses null-terminated strings) and such keys are very likely in binary data. In DAWG versions prior to 0.5 ```` was ``chr(255)`` byte. It was chosen because keys are stored as UTF8-encoded strings and ``chr(255)`` is guaranteed not to appear in valid UTF8, so the end of text part of the key is not ambiguous. But ``chr(255)`` was proven to be problematic: it changes the order of the keys. Keys are naturally returned in lexicographical order by DAWG. But if ``chr(255)`` appears at the end of each text part of a key then the visible order would change. Imagine ``'foo'`` key with some payload and ``'foobar'`` key with some payload. ``'foo'`` key would be greater than ``'foobar'`` key: values compared would be ``'foo'`` and ``'foobar'`` and ``ord()==255`` is greater than ``ord()``. So now the default ```` is chr(1). This is the lowest allowed character and so it preserves the alphabetical order. It is not strictly correct to use chr(1) as a separator because chr(1) is a valid UTF8 character. But I think in practice this won't be an issue: such control character is very unlikely in text keys, and binary keys are not supported anyway because dawgdic_ doesn't support keys containing chr(0). If you can't guarantee chr(1) is not a part of keys, lexicographical order is not important to you or there is a need to read a ``BytesDAWG``/``RecordDAWG`` created by DAWG < 0.5 then pass ``payload_separator`` argument to the constructor:: >>> BytesDAWG(payload_separator=b'\xff').load('old.dawg') The storage scheme has one more implication: values of ``BytesDAWG`` and ``RecordDAWG`` are also sorted lexicographically. For ``RecordDAWG`` there is a gotcha: in order to have meaningful ordering of numeric values store them in big-endian format:: >>> data = [('foo', (3, 2, 256)), ('foo', (3, 2, 1)), ('foo', (3, 2, 3))] >>> d = RecordDAWG("3H", data) >>> d.items() [(u'foo', (3, 2, 256)), (u'foo', (3, 2, 1)), (u'foo', (3, 2, 3))] >>> d2 = RecordDAWG(">3H", data) >>> d2.items() [(u'foo', (3, 2, 1)), (u'foo', (3, 2, 3)), (u'foo', (3, 2, 256))] IntDAWG and IntCompletionDAWG ----------------------------- ``IntDAWG`` is a ``{unicode -> int}`` mapping. It is possible to use ``RecordDAWG`` for this, but ``IntDAWG`` is natively supported by dawgdic_ C++ library and so ``__getitem__`` is much faster. Unlike ``BytesDAWG`` and ``RecordDAWG``, ``IntDAWG`` doesn't support having several values for the same key. ``IntDAWG`` constructor accepts an iterable with (unicode_key, integer_value) tuples:: >>> data = [ (u'foo', 1), (u'bar', 2) ] >>> int_dawg = dawg.IntDAWG(data) It is then possible to get a value from the IntDAWG:: >>> int_dawg[u'foo'] 1 ``IntCompletionDAWG`` supports all ``IntDAWG`` and ``CompletionDAWG`` methods, plus ``.items()`` and ``.iteritems()``. Persistence ----------- All DAWGs support saving/loading and pickling/unpickling. Write DAWG to a stream:: >>> with open('words.dawg', 'wb') as f: ... d.write(f) Save DAWG to a file:: >>> d.save('words.dawg') Load DAWG from a file:: >>> d = dawg.DAWG() >>> d.load('words.dawg') .. warning:: Reading DAWGs from streams and unpickling are currently using 3x memory compared to loading DAWGs using ``load`` method; please avoid them until the issue is fixed. Read DAWG from a stream:: >>> d = dawg.RecordDAWG(format_string) >>> with open('words.record-dawg', 'rb') as f: ... d.read(f) DAWG objects are picklable:: >>> import pickle >>> data = pickle.dumps(d) >>> d2 = pickle.loads(data) Benchmarks ========== For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7): * dict(unicode words -> word lengths): about 600M * list(unicode words) : about 300M * ``marisa_trie.RecordTrie`` : 11M * ``marisa_trie.Trie``: 7M * ``dawg.DAWG``: 2M * ``dawg.CompletionDAWG``: 3M * ``dawg.IntDAWG``: 2.7M * ``dawg.RecordDAWG``: 4M .. note:: Lengths of words were not stored as values in ``dawg.DAWG``, ``dawg.CompletionDAWG`` and ``marisa_trie.Trie`` because they don't support this. .. note:: `marisa-trie`_ is often more memory efficient than DAWG (depending on data); it can also handle larger datasets and provides memory-mapped IO, so don't dismiss `marisa-trie`_ based on this README file. It is still several times slower than DAWG though. .. _marisa-trie: https://github.com/pytries/marisa-trie Benchmark results (100k unicode words, integer values (lengths of the words), Python 3.3, macbook air i5 1.8 Ghz):: dict __getitem__ (hits) 7.300M ops/sec DAWG __getitem__ (hits) not supported BytesDAWG __getitem__ (hits) 1.230M ops/sec RecordDAWG __getitem__ (hits) 0.792M ops/sec IntDAWG __getitem__ (hits) 4.217M ops/sec dict get() (hits) 3.775M ops/sec DAWG get() (hits) not supported BytesDAWG get() (hits) 1.027M ops/sec RecordDAWG get() (hits) 0.733M ops/sec IntDAWG get() (hits) 3.162M ops/sec dict get() (misses) 4.533M ops/sec DAWG get() (misses) not supported BytesDAWG get() (misses) 3.545M ops/sec RecordDAWG get() (misses) 3.485M ops/sec IntDAWG get() (misses) 3.928M ops/sec dict __contains__ (hits) 7.090M ops/sec DAWG __contains__ (hits) 4.685M ops/sec BytesDAWG __contains__ (hits) 3.885M ops/sec RecordDAWG __contains__ (hits) 3.898M ops/sec IntDAWG __contains__ (hits) 4.612M ops/sec dict __contains__ (misses) 5.617M ops/sec DAWG __contains__ (misses) 6.204M ops/sec BytesDAWG __contains__ (misses) 6.026M ops/sec RecordDAWG __contains__ (misses) 6.007M ops/sec IntDAWG __contains__ (misses) 6.180M ops/sec DAWG.similar_keys (no replaces) 0.492M ops/sec DAWG.similar_keys (l33t) 0.413M ops/sec dict items() 55.032 ops/sec DAWG items() not supported BytesDAWG items() 14.826 ops/sec RecordDAWG items() 9.436 ops/sec IntDAWG items() not supported dict keys() 200.788 ops/sec DAWG keys() not supported BytesDAWG keys() 20.657 ops/sec RecordDAWG keys() 20.873 ops/sec IntDAWG keys() not supported DAWG.prefixes (hits) 1.552M ops/sec DAWG.prefixes (mixed) 4.342M ops/sec DAWG.prefixes (misses) 4.094M ops/sec DAWG.iterprefixes (hits) 0.391M ops/sec DAWG.iterprefixes (mixed) 0.476M ops/sec DAWG.iterprefixes (misses) 0.498M ops/sec RecordDAWG.keys(prefix="xxx"), avg_len(res)==415 5.562K ops/sec RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17 104.011K ops/sec RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3 318.129K ops/sec RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 462.238K ops/sec RecordDAWG.keys(prefix="xxx"), NON_EXISTING 4292.625K ops/sec Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set. Current limitations =================== * ``IntDAWG`` is currently a subclass of ``DAWG`` and so it doesn't support ``keys()`` and ``items()`` methods; * ``read()`` method reads the whole stream (DAWG must be the last or the only item in a stream if it is read with ``read()`` method) - pickling doesn't have this limitation; * DAWGs loaded with ``read()`` and unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with ``load()`` method; * there are ``keys()`` and ``items()`` methods but no ``values()`` method; * iterator versions of methods are not always implemented; * ``BytesDAWG`` and ``RecordDAWG`` has a limitation: values larger than 8KB are unsupported; * the maximum number of DAWG units is limited: number of DAWG units (and thus transitions - but not elements) should be less than 2^29; this mean that it may be impossible to build an especially huge DAWG (you may split your data into several DAWGs or try `marisa-trie`_ in this case). Contributions are welcome! Contributing ============ Development happens at github: https://github.com/pytries/DAWG Issue tracker: https://github.com/pytries/DAWG/issues Feel free to submit ideas, bugs or pull requests. If you found a bug in a C++ part please report it to the original `bug tracker `_. How is source code organized ---------------------------- There are 4 folders in repository: * ``bench`` - benchmarks & benchmark data; * ``lib`` - original unmodified `dawgdic`_ C++ library and a customized version of `libb64`_ library. They are bundled for easier distribution; if something is have to be fixed in these libraries consider fixing it in the original repositories; * ``src`` - wrapper code; ``src/dawg.pyx`` is a wrapper implementation; ``src/*.pxd`` files are Cython headers for corresponding C++ headers; ``src/*.cpp`` files are the pre-built extension code and shouldn't be modified directly (they should be updated via ``update_cpp.sh`` script). * ``tests`` - the test suite. Running tests and benchmarks ---------------------------- Make sure `tox`_ is installed and run :: $ tox from the source checkout. Tests should pass under python 2.6, 2.7, 3.2, 3.3 and 3.4. In order to run benchmarks, type :: $ tox -c bench.ini .. _cython: http://cython.org .. _tox: http://tox.testrun.org .. include:: ../AUTHORS.rst .. include:: ../CHANGES.rst