• 1 Post
  • 31 Comments
Joined 1 year ago
cake
Cake day: June 14th, 2023

help-circle








  • Your code looked alright. Working in C is a risky chore. You’re early in your journey and I think it’s good to get a taste of many of the traditional techniques before turning towards newer fancier algorithms.

    “Haskell and OCaml”, hm? The two concepts you need to internalize are katamorphisms and paramorphisms. You may have heard of “recursion schemes”; these two schemes are the ones available on any tree. Fundamentally, most of a compiler is tree-to-tree transformations (nanopasses), and they are expressible as one of two forms:

    • Katamorphism: The leaves of each node are transformed before the branch (bottom-up)
    • Paramorphism: The leaves are transformed after/during the branch transformation (top-down)

    If you look again at my AST builder builder, you’ll see .run() and .walk() methods, implementing the recursion for any katamorphism or paramorphism respectively. In Haskell, these are called Traversable types. This is a pun in Monte, where .run() is also the default method; the syntax makes it easy to perform a katamorphism by passing a tree-traversing object directly to the AST.

    Your types are alright, but you’ll want to pass a generic parameter through them, turning them into a valid Functor in Haskell or a generic module in OCaml. This is a basic defense against the “AST Decoration Problem”, AKA the “AST Typing Problem”. As you add features to your compiler, you’ll realize why these are necessary.





  • I copied and pasted from the terminal to ensure that I formatted the error message properly. The question-mark prompt is what E used, or at least E-on-Java. Monte used a little Unicode mountain:

    ⛰  currentProcess.getProcessID() :Int
    Result: 2805098
    ⛰  def x :Int := "42"
    Exception: "42" does not conform to Int
    ⛰  "42" :Int
    Exception: "42" does not conform to Int
    

    I can’t really give a reason other than that the prompt characters on Unix-like systems are arbitrary and most REPL libraries allow them to be customized.


  • [HTML and Markdown] are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

    This is at least half-wrong, in that HTML is clearly not regular. The proof is simple: HTML structurally embeds a language of balanced parentheses (a Dyck language), and such languages are context-free and not regular. I don’t know Markdown well and there are several flavors; it’s quite possible that some flavors are regular. However, traditional Markdown embeds HTML, so if HTML is not regular than neither is Markdown.

    I once did a syntax-directed translation of Markdown to HTML in AWK!

    Sure. The original Markdown implementation was in Perl and operated similarly. However, note that this doesn’t imply that either language is regular, only that a translation is possible assuming the input is valid Markdown. Punting on recognition means that invalid parse trees have undefined translations, presumably at least sometimes generating invalid HTML.


  • There are Python compilers which do AST analysis instead of bytecode analysis, particularly Nuitka and Shed Skin. They aren’t very good, but it’s not clear whether that’s because working with the AST is somehow harder than working with the bytecode. RPython doesn’t compile all bytecodes; most generator/coroutine functionality is missing, for example.

    Think of type-checking as a syntactic analysis; this is how it avoids Rice’s theorem. Like you say, we can annotate names with type information, and we can do it without evaluating the code. The main problem here is that Python’s semantics don’t require these annotations to enforce the types of values; you may be interested in E, a research language from the 90s which did enforce type annotations on otherwise-untyped names. In Python, this doesn’t error:

    >>> x :int = "42"
    

    But in E, this does error:

    ? def x :int := "42"
    # problem: <ClassCastException: String doesn't coerce to an int>
    

    Sadly, E is long dead, and something of an archeological artifact rather than a usable system. But it may be inspiring to your future efforts, especially since it sounds like you’re learning how to build compilers. (I helped write Monte, a language which blends E and Python; it is also dead, but was more enjoyable than E.)


  • I’ve only skimmed the paper, so let me know if I’ve missed something, ideally with a page number. Also, it’s late and I’m tired, so I’m not hyperlinking anything; sorry.

    I’m not sure what a “full semantic analysis” entails, but always keep Rice’s theorem in mind: there aren’t any interesting semantic analyses available for Turing-complete systems.

    Python is a descendant of Smalltalk. Like several of its cousins, particularly the famous ECMAScript, Python doesn’t have types or classes in the Smalltalk sense, but prototypes which form a class-like hierarchy. From the static-analysis point of view, whether a type is created or instantiated is a matter of Rice’s theorem.

    The ability to invoke type() at runtime is not lazy. Python is eager and strict; even generators are eager and strict, although they can cause stack frames to become “stale”; whether a stale stack frame is cleaned up is also a matter of Rice’s theorem.

    None of this prevents compilation of Python. The RPython toolchain first imports an application, evaluating all calls to type() and pre-building all classes; then, it statically analyzes all of the Python objects in memory and decompiles their bytecode to determine their behaviors. The resulting executable behaves as if it were started from a snapshot of the Python heap.

    Yes, CPython sucks. Use PyPy instead; also, use cffi to wrap C libraries.




  • It would be incorrect, but only because you’ve learned from vague teaching materials. You’re on the right track; let’s just be a bit more precise about terminology.

    An algorithm is a sequence of numbered steps, where some steps make decisions and some steps point towards other steps. Equivalently (thanks to the structured-programming theorem) an algorithm is a flowchart with decision nodes. Machines typically evaluate algorithms directly; the numbered steps can be mapped to sequences of machine instructions. Machine states arise from considering how a particular algorithm behaves on a given machine.

    A specification is a pair of claims, often called the precondition and postcondition, which logically connect the inputs and outputs of a machine. The idea is that, if we fulfill the precondition prior to running the machine, then the outputs of the machine will fulfill the postcondition. The study of specifications is often called “formal methods”, a terrible name for a great idea.

    The connection is hopefully straightforward to see: an algorithm might happen to fulfill a specification. In general, there are often multiple algorithms for a given specification; for example, there’s usually an infinite number of ways to add a do-nothing instruction to an algorithm without changing any of its behaviors. A classic example is sorting; there are many sorting algorithms, all of which fulfill the postcondition that the output list is sorted, usually with the precondition that the input list contains sortable/orderable elements.