BaseDataPublisher (WMF specific extenstions to Gobblin

java.lang.Object
- org.apache.gobblin.publisher.DataPublisher
- - org.apache.gobblin.publisher.SingleTaskDataPublisher
  - - org.wikimedia.gobblin.copy.BaseDataPublisher

All Implemented Interfaces:

Closeable, AutoCloseable, org.apache.gobblin.capability.CapabilityAware

Direct Known Subclasses:

TimePartitionedDataPublisher
```
public class BaseDataPublisher
extends org.apache.gobblin.publisher.SingleTaskDataPublisher
```
A basic implementation of SingleTaskDataPublisher that publishes the data from the writer output directory to the final output directory.
The final output directory is specified by ConfigurationKeys.DATA_PUBLISHER_FINAL_DIR. The output of each writer is written to this directory. Each individual writer can also specify a path in the config key ConfigurationKeys.WRITER_FILE_PATH. Then the final output data for a writer will be ConfigurationKeys.DATA_PUBLISHER_FINAL_DIR/ConfigurationKeys.WRITER_FILE_PATH. If the ConfigurationKeys.WRITER_FILE_PATH is not specified, a default one is assigned. The default path is constructed in the Extract.getOutputFilePath() method.

This publisher records all dirs it publishes to in property ConfigurationKeys.PUBLISHER_DIRS. Each time it publishes a Path, if the path is a directory, it records this path. If the path is a file, it records the parent directory of the path. To change this behavior one may override recordPublisherOutputDirs(Path, Path, int).

This is an updated copy of BaseDataPublisher in gobblin-core module. This file should be deleted in favor of the upstream version when possible. Updates are:
- Extract function addWriterOutputToNewDir(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, org.apache.gobblin.configuration.WorkUnitState, int, org.apache.gobblin.util.ParallelRunner) from publishData(org.apache.gobblin.configuration.WorkUnitState) allowing to override the behavior in TimePartitionedDataPublisher -- https://github.com/apache/gobblin/pull/3409

Field Summary

Fields
Modifier and Type	Field and Description
`protected com.google.common.io.Closer`	`closer`
`protected com.google.common.base.Optional<org.apache.gobblin.metrics.event.lineage.LineageInfo>`	`lineageInfo`
`protected Map<org.apache.gobblin.writer.PartitionIdentifier,org.apache.gobblin.metadata.MetadataMerger<String>>`	`metadataMergers`
`protected List<org.apache.hadoop.fs.FileSystem>`	`metaDataWriterFileSystemByBranches`
`protected int`	`numBranches`
`protected com.google.common.io.Closer`	`parallelRunnerCloser`
`protected Map<String,org.apache.gobblin.util.ParallelRunner>`	`parallelRunners`
`protected int`	`parallelRunnerThreads`
`protected List<org.apache.hadoop.fs.permission.FsPermission>`	`permissions`
`protected List<org.apache.hadoop.fs.FileSystem>`	`publisherFileSystemByBranches`
`protected List<com.google.common.base.Optional<String>>`	`publisherFinalDirOwnerGroupsByBranches`
`protected Set<org.apache.hadoop.fs.Path>`	`publisherOutputDirs`
`protected com.typesafe.config.Config`	`retrierConfig`
`protected boolean`	`shouldRetry`
`protected List<org.apache.hadoop.fs.FileSystem>`	`writerFileSystemByBranches`

Fields inherited from class org.apache.gobblin.publisher.DataPublisher
REUSABLE, state

Constructor Summary

Constructors
Constructor and Description

BaseDataPublisher(org.apache.gobblin.configuration.State state)

Constructors
Constructor and Description
`BaseDataPublisher(org.apache.gobblin.configuration.State state)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`addSingleTaskWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir, org.apache.hadoop.fs.Path publisherOutputDir, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner)`
`protected void`	`addWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir, org.apache.hadoop.fs.Path publisherOutputDir, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner)`
`protected void`	`addWriterOutputToNewDir(org.apache.hadoop.fs.Path writerOutput, org.apache.hadoop.fs.Path publisherOutput, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner)`
`void`	`close()`
`protected org.apache.gobblin.dataset.DatasetDescriptor`	`createDestinationDescriptor(org.apache.gobblin.configuration.WorkUnitState state, int branchId)` Create destination dataset descriptor
`protected org.apache.hadoop.fs.Path`	`getPublisherOutputDir(org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId)` Get the output directory path this `BaseDataPublisher` will write to.
`void`	`initialize()`
`protected void`	`movePath(org.apache.gobblin.util.ParallelRunner parallelRunner, org.apache.gobblin.configuration.State state, org.apache.hadoop.fs.Path src, org.apache.hadoop.fs.Path dst, int branchId)`
`void`	`publishData(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states)`
`void`	`publishData(org.apache.gobblin.configuration.WorkUnitState state)`
`protected void`	`publishData(org.apache.gobblin.configuration.WorkUnitState state, int branchId, boolean publishSingleTaskData, Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved)`
`void`	`publishMetadata(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states)` Merge all of the metadata output from each work-unit and publish the merged record.
`void`	`publishMetadata(org.apache.gobblin.configuration.WorkUnitState state)` Publish metadata for each branch.
`protected void`	`publishMultiTaskData(org.apache.gobblin.configuration.WorkUnitState state, int branchId, Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved)` This method publishes task output data for the given `WorkUnitState`, but if there are output data of other tasks in the same folder, it may also publish those data.
`protected Collection<org.apache.hadoop.fs.Path>`	`recordPublisherOutputDirs(org.apache.hadoop.fs.Path src, org.apache.hadoop.fs.Path dst, int branchId)`
`protected boolean`	`shouldPublishMetadataFirst()` The BaseDataPublisher relies on publishData() to create and clean-up the output directories, so data has to be published before the metadata can be.

Methods inherited from class org.apache.gobblin.publisher.SingleTaskDataPublisher
getInstance, publish

Methods inherited from class org.apache.gobblin.publisher.DataPublisher
canBeSkipped, getState, isThreadSafe, markCommit, publish, supportsCapability

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

numBranches
```
protected final int numBranches
```

writerFileSystemByBranches

protected final List<org.apache.hadoop.fs.FileSystem> writerFileSystemByBranches

publisherFileSystemByBranches

protected final List<org.apache.hadoop.fs.FileSystem> publisherFileSystemByBranches

metaDataWriterFileSystemByBranches

protected final List<org.apache.hadoop.fs.FileSystem> metaDataWriterFileSystemByBranches

publisherFinalDirOwnerGroupsByBranches

protected final List<com.google.common.base.Optional<String>> publisherFinalDirOwnerGroupsByBranches

permissions

protected final List<org.apache.hadoop.fs.permission.FsPermission> permissions

closer

protected final com.google.common.io.Closer closer

parallelRunnerCloser

protected final com.google.common.io.Closer parallelRunnerCloser

parallelRunnerThreads

protected final int parallelRunnerThreads

parallelRunners

protected final Map<String,org.apache.gobblin.util.ParallelRunner> parallelRunners

publisherOutputDirs

protected final Set<org.apache.hadoop.fs.Path> publisherOutputDirs

lineageInfo

protected final com.google.common.base.Optional<org.apache.gobblin.metrics.event.lineage.LineageInfo> lineageInfo

metadataMergers

protected final Map<org.apache.gobblin.writer.PartitionIdentifier,org.apache.gobblin.metadata.MetadataMerger<String>> metadataMergers

shouldRetry
```
protected final boolean shouldRetry
```

retrierConfig

protected final com.typesafe.config.Config retrierConfig

Constructor Detail

BaseDataPublisher

public BaseDataPublisher(org.apache.gobblin.configuration.State state)
                  throws IOException

Throws:: IOException

Method Detail

initialize
```
public void initialize()
                throws IOException
```
Specified by:

initialize in class org.apache.gobblin.publisher.DataPublisher

Throws:

IOException

close

public void close()
           throws IOException

Throws:: IOException

createDestinationDescriptor

protected org.apache.gobblin.dataset.DatasetDescriptor createDestinationDescriptor(org.apache.gobblin.configuration.WorkUnitState state,
                                                                                   int branchId)

Create destination dataset descriptor

publishData

public void publishData(org.apache.gobblin.configuration.WorkUnitState state)
                 throws IOException

Specified by:: publishData in class org.apache.gobblin.publisher.SingleTaskDataPublisher
Throws:: IOException

publishData

public void publishData(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states)
                 throws IOException

Specified by:: publishData in class org.apache.gobblin.publisher.DataPublisher
Throws:: IOException

publishMultiTaskData

protected void publishMultiTaskData(org.apache.gobblin.configuration.WorkUnitState state,
                                    int branchId,
                                    Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved)
                             throws IOException

This method publishes task output data for the given WorkUnitState, but if there are output data of other tasks in the same folder, it may also publish those data.

Throws:: IOException

publishData

protected void publishData(org.apache.gobblin.configuration.WorkUnitState state,
                           int branchId,
                           boolean publishSingleTaskData,
                           Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved)
                    throws IOException

Throws:: IOException

getPublisherOutputDir
```
protected org.apache.hadoop.fs.Path getPublisherOutputDir(org.apache.gobblin.configuration.WorkUnitState workUnitState,
                                                          int branchId)
```
Get the output directory path this BaseDataPublisher will write to.
This is the default implementation. Subclasses of BaseDataPublisher may override this to write to a custom directory or write using a custom directory structure or naming pattern.

Parameters:

workUnitState - a WorkUnitState object

branchId - the fork branch ID

Returns:

the output directory path this BaseDataPublisher will write to

addSingleTaskWriterOutputToExistingDir

protected void addSingleTaskWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir,
                                                      org.apache.hadoop.fs.Path publisherOutputDir,
                                                      org.apache.gobblin.configuration.WorkUnitState workUnitState,
                                                      int branchId,
                                                      org.apache.gobblin.util.ParallelRunner parallelRunner)
                                               throws IOException

Throws:: IOException

addWriterOutputToNewDir

protected void addWriterOutputToNewDir(org.apache.hadoop.fs.Path writerOutput,
                                       org.apache.hadoop.fs.Path publisherOutput,
                                       org.apache.gobblin.configuration.WorkUnitState workUnitState,
                                       int branchId,
                                       org.apache.gobblin.util.ParallelRunner parallelRunner)
                                throws IOException

Throws:: IOException

addWriterOutputToExistingDir

protected void addWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir,
                                            org.apache.hadoop.fs.Path publisherOutputDir,
                                            org.apache.gobblin.configuration.WorkUnitState workUnitState,
                                            int branchId,
                                            org.apache.gobblin.util.ParallelRunner parallelRunner)
                                     throws IOException

Throws:: IOException

movePath

protected void movePath(org.apache.gobblin.util.ParallelRunner parallelRunner,
                        org.apache.gobblin.configuration.State state,
                        org.apache.hadoop.fs.Path src,
                        org.apache.hadoop.fs.Path dst,
                        int branchId)
                 throws IOException

Throws:: IOException

recordPublisherOutputDirs

protected Collection<org.apache.hadoop.fs.Path> recordPublisherOutputDirs(org.apache.hadoop.fs.Path src,
                                                                          org.apache.hadoop.fs.Path dst,
                                                                          int branchId)
                                                                   throws IOException

Throws:: IOException

publishMetadata
```
public void publishMetadata(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states)
                     throws IOException
```
Merge all of the metadata output from each work-unit and publish the merged record.

Specified by:

publishMetadata in class org.apache.gobblin.publisher.DataPublisher

Parameters:

states - States from all tasks

Throws:

IOException - If there is an error publishing the file

publishMetadata
```
public void publishMetadata(org.apache.gobblin.configuration.WorkUnitState state)
                     throws IOException
```
Publish metadata for each branch. We expect the metadata to be of String format and populated in either the WRITER_MERGED_METADATA_KEY state or the WRITER_METADATA_KEY configuration key.

Specified by:

publishMetadata in class org.apache.gobblin.publisher.SingleTaskDataPublisher

Throws:

IOException

shouldPublishMetadataFirst
```
protected boolean shouldPublishMetadataFirst()
```
The BaseDataPublisher relies on publishData() to create and clean-up the output directories, so data has to be published before the metadata can be.

Overrides:

shouldPublishMetadataFirst in class org.apache.gobblin.publisher.DataPublisher

Class BaseDataPublisher

Field Summary

Fields inherited from class org.apache.gobblin.publisher.DataPublisher

Constructor Summary

Method Summary

Methods inherited from class org.apache.gobblin.publisher.SingleTaskDataPublisher

Methods inherited from class org.apache.gobblin.publisher.DataPublisher

Methods inherited from class java.lang.Object

Field Detail

numBranches

writerFileSystemByBranches

publisherFileSystemByBranches

metaDataWriterFileSystemByBranches

publisherFinalDirOwnerGroupsByBranches

permissions

closer

parallelRunnerCloser

parallelRunnerThreads

parallelRunners

publisherOutputDirs

lineageInfo

metadataMergers

shouldRetry

retrierConfig

Constructor Detail

BaseDataPublisher

Method Detail

initialize

close

createDestinationDescriptor

publishData

publishData

publishMultiTaskData

publishData

getPublisherOutputDir

addSingleTaskWriterOutputToExistingDir

addWriterOutputToNewDir

addWriterOutputToExistingDir

movePath

recordPublisherOutputDirs

publishMetadata

publishMetadata

shouldPublishMetadataFirst