Shinichiro Hamaji | 89b255a | 2015-11-09 16:47:42 +0900 | [diff] [blame] | 1 | |
| 2 | ijar: A tool for generating interface .jars from normal .jars |
| 3 | ============================================================= |
| 4 | |
| 5 | Alan Donovan, 26 May 2007. |
| 6 | |
| 7 | Rationale: |
| 8 | |
| 9 | In order to improve the speed of compilation of Java programs in |
| 10 | Bazel, the output of build steps is cached. |
| 11 | |
| 12 | This works very nicely for C++ compilation: a compilation unit |
| 13 | includes a .cc source file and typically dozens of header files. |
| 14 | Header files change relatively infrequently, so the need for a |
| 15 | rebuild is usually driven by a change in the .cc file. Even after |
| 16 | syncing a slightly newer version of the tree and doing a rebuild, |
| 17 | many hits in the cache are still observed. |
| 18 | |
| 19 | In Java, by contrast, a compilation unit involves a set of .java |
| 20 | source files, plus a set of .jar files containing already-compiled |
| 21 | JVM .class files. Class files serve a dual purpose: from the JVM's |
| 22 | perspective, they are containers of executable code, but from the |
| 23 | compiler's perspective, they are interface definitions. The problem |
| 24 | here is that .jar files are very much more sensitive to change than |
| 25 | C++ header files, so even a change that is insignificant to the |
| 26 | compiler (such as the addition of a print statement to a method in a |
| 27 | prerequisite class) will cause the jar to change, and any code that |
| 28 | depends on this jar's interface will be recompiled unnecessarily. |
| 29 | |
| 30 | The purpose of ijar is to produce, from a .jar file, a much smaller, |
| 31 | simpler .jar file containing only the parts that are significant for |
| 32 | the purposes of compilation. In other words, an interface .jar |
| 33 | file. By changing ones compilation dependencies to be the interface |
| 34 | jar files, unnecessary recompilation is avoided when upstream |
| 35 | changes don't affect the interface. |
| 36 | |
| 37 | Details: |
| 38 | |
| 39 | ijar is a tool that reads a .jar file and emits a .jar file |
| 40 | containing only the parts that are relevant to Java compilation. |
| 41 | For example, it throws away: |
| 42 | |
| 43 | - Files whose name does not end in ".class". |
| 44 | - All executable method code. |
| 45 | - All private methods and fields. |
| 46 | - All constants and attributes except the minimal set necessary to |
| 47 | describe the class interface. |
| 48 | - All debugging information |
| 49 | (LineNumberTable, SourceFile, LocalVariableTables attributes). |
| 50 | |
| 51 | It also sets to zero the file modification times in the index of the |
| 52 | .jar file. |
| 53 | |
| 54 | Implementation: |
| 55 | |
| 56 | ijar is implemented in C++, and runs very quickly. For example |
| 57 | (when optimized) it takes only 530ms to process a 42MB |
| 58 | .jar file containing 5878 classe, resulting in an interface .jar |
| 59 | file of only 11.4MB in size. For more usual .jar sizes of a few |
| 60 | megabytes, a runtime of 50ms is typical. |
| 61 | |
| 62 | The implementation strategy is to mmap both the input jar and the |
| 63 | newly-created _interface.jar, and to scan through the former and |
| 64 | emit the latter in a single pass. There are a couple of locations |
| 65 | where some kind of "backpatching" is required: |
| 66 | |
| 67 | - in the .zip file format, for each file, the size field precedes |
| 68 | the data. We emit a zero but note its location, generate and emit |
| 69 | the stripped classfile, then poke the correct size into the |
| 70 | location. |
| 71 | |
| 72 | - for JVM .class files, the header (including the constant table) |
| 73 | precedes the body, but cannot be emitted before it because it's |
| 74 | not until we emit the body that we know which constants are |
| 75 | referenced and which are garbage. So we emit the body into a |
| 76 | temporary buffer, then emit the header to the output jar, followed |
| 77 | by the contents of the temp buffer. |
| 78 | |
| 79 | Also note that the zip file format has unnecessary duplication of |
| 80 | the index metadata: it has header+data for each file, then another |
| 81 | set of (similar) headers at the end. Rather than save the metadata |
| 82 | explicitly in some datastructure, we just record the addresses of |
| 83 | the already-emitted zip metadata entries in the output file, and |
| 84 | then read from there as necessary. |
| 85 | |
| 86 | Notes: |
| 87 | |
| 88 | This code has no dependency except on the STL and on zlib. |
| 89 | |
| 90 | Almost all of the getX/putX/ReadX/WriteX functions in the code |
| 91 | advance their first argument pointer, which is passed by reference. |
| 92 | |
| 93 | It's tempting to discard package-private classes and class members. |
| 94 | However, this would be incorrect because they are a necessary part |
| 95 | of the package interface, as a Java package is often compiled in |
| 96 | multiple stages. For example: in Bazel, both java tests and java |
| 97 | code inhabit the same Java package but are compiled separately. |
| 98 | |
| 99 | Assumptions: |
| 100 | |
| 101 | We assume that jar files are uncompressed v1.0 zip files (created |
| 102 | with 'jar c0f') with a zero general_purpose_bit_flag. |
| 103 | |
| 104 | We assume that javap/javac don't need the correct CRC checksums in |
| 105 | the .jar file. |
| 106 | |
| 107 | We assume that it's better simply to abort in the face of unknown |
| 108 | input than to risk leaving out something important from the output |
| 109 | (although in the case of annotations, it should be safe to ignore |
| 110 | ones we don't understand). |
| 111 | |
| 112 | TODO: |
| 113 | Maybe: ensure a canonical sort order is used for every list (jar |
| 114 | entries, class members, attributes, etc.) This isn't essential |
| 115 | because we can assume the compiler is deterministic and the order in |
| 116 | the source files changes little. Also, it would require two passes. :( |
| 117 | |
| 118 | Maybe: delete dynamically-allocated memory. |
| 119 | |
| 120 | Add (a lot) more tests. Include a test of idempotency. |