Cascading Change Log 0.4.0 Changed c.p.GroupBy default grouping fields to c.t.Fields.ALL from Fields.FIRST. This change provides a simple way to sort a tuple stream based on the order of the tuple fields. Changed c.f.FlowConnector to create c.f.Flow instances that will bypass the reducer if no c.p.Group is participating in the assembly. Previoiusly Group instances were inserted if missing. This allows a chain of c.p.Every instances to be used to process/filter a tuple stream without the invoking the reducer needlessly (if a sort isn't required). This change also supports bypassing the default Hadoop OutputCollector in the mapper via the sink c.t.Tap instance. Changed c.f.FlowStep behavior to run in 'local' mode if either the sink or source tap is a c.t.Lfs instance. This allows for c.f.Flow instances to run mixed if configured to execute on a particular cluster by default. This behavior supports complex import/export processes against the HDFS or other supported remote filesystem. Changed behavior of c.t.Dfs to force use of HDFS. Previously Dfs would default to the local FileSystem if the job was run in 'local'mode. Now a Dfs instance will cause failures if it cannot connect to a HDFS cluster. Using c.t.Hfs will provide previous Dfs behavior. Hfs will use the 'default' filesystem if a scheme is not present in the 'stringPath' (i.e. hdfs://host:port/some/path). Added c.stats package to allow for collecting statics of Cascades, Flows, and FlowSteps. Updated c.f.Flow and c.c.Cascade log messages to be easier to follow when executing many flow instances simultaneously. Added compression flag to c.s.TextLine. Can now toggle compression (Hadoop style compression) per Tap instance. This prevents clusters with compression enabled by default to export text files with a .deflate extension. Added support for bypassing Hadoop OutputCollector via Tap.setUseTapCollector() method. Setting to true will force Cascading to use the c.t.TapCollector instead. This bypasses bugs in Hadoop with custom FileSystem types. This will always be true for http(s) and s3tp filesystems when using a c.t.Hfs Tap type (atleast until HADOOP-3021 is resolved). Added c.t.TupleCollector, complementing c.t.TupleIterator, for directly writing Tuple instances out via a c.t.Tap instance. Added c.f.FlowListener so that c.f.Flow instances can fire events on starting, completed, and throwable. Changed c.t.h.S3HttpFileSystem so it can now create files remotely. Renamed cascading.spill.threshold to cascading.cogroup.spill.threshold, so there is less a chance of collision. Made numerous optimizations to improve overall performance. Namely split and merge of key/value tuples to remove redundancy in the stream between the mapper and reducer. Changed c.p.Operators to push c.o.Operation results directly through to next operation without intermediate collection. This should improve pipelining of large result streams and lower runtime memory footprint. Changed c.c.Cascade so it now runs Flows in parallel if Hadoop is clustered, and there are no dependencies between the Flows. Moved c.Cascade and related classed to c.cascade package. Wanted to preempt any future ugliness. Added support in c.t.h.S3HttpFileSystem for these properties: fs.s3tp.awsAccessKeyId and fs.s3tp.awsSecretAccessKey 0.3.0 Added ability to push Log4j logger properties to mapper/reducer via JobConf. Use jobConf.set("log4j.logger","logger1=LEVEL,logger2=LEVEL") Added missing equals() and hashCode() in c.t.MultiTap. Added c.t.h.ZipInputFormat (and ZipSplit) to support zip files. c.s.TextLine supports transparent reading of zip files if the filename ends with .zip, but cannot write to them. This code is loosely based on HADOOP-1824. If the underlying filesystem is hdfs or file, splits will be created for each ZipEntry. Otherwise ZipEntries are iterated over to be more stream friendly. Progress status is supported. Added http, https, and s3tp read-only file systems to Hadoop. Use these URLs, respectively: http://, https://, and s3tp://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/key Added c.o.t.DateFormatter supporting text formatting of time stamps created by c.o.t.DateParser. Fixed bug where in complex assemblies, some Scopes were not resolved. Fixed bug where tap instances were not being inserted before some CoGroup joins if there was a previous Group in the assembly. Upgraded JGraphT to 0.7.3 Changed c.t.SpillableTupleList allows for iteration across entries. Changed c.f.FlowException to optionally allow for printing of underlying pipe graph for debugging. Added c.o.t.FieldFormatter function to format Tuples into complex strings using j.u.Formatter formatting. Added c.o.a.Last aggregator to find the last value encountered in a group. Changed c.o.a.Max and c.o.a.Min to maintain original value type. Will return null if no values are encountered. Changed c.o.a.First to use Fields.ARG by default. Removed Fields constructor. Added c.t.Fields.join(Fields...) method to allow for joining multiple Fields instances into a new instance. Can retrieve Tuple values by field name through the TupleEntry class via the get(String) method. Added c.t.TupleCollector interface to simplify the operation interfaces. Added a Debug filter that will print to either stderr or stdout. Useful for debugging stream transformations. Added CascadingTestCase base test class Added Insert Function that allows for literal values to be inserted into the Tuple stream. 0.2.0 CoGroup will now spill to disk on extremely large co-groupings. Configurable via "cascading.spill.threshold". Defaults to 10k elements. java.util.Properties instances can be used to set defauls for FlowConnectors. Fix for InnerJoin, the default join for CoGroup. Introduced MultiTap to support concatenation of files into a pipe assembly. RegexParser now fails on a failed match. Prevents it being used or behaving as a filter. Fixed bug with PipeAssembly instances not properly being assimiliated into the pipeGraph. Fixed assertion error thrown by JGraphT. Renamed Tap method deleteOnInit to deleteOnSinkInit. 0.1.0 First release.