public class BaseDataPublisher
extends org.apache.gobblin.publisher.SingleTaskDataPublisher
SingleTaskDataPublisher
that publishes the data from the writer output directory
to the final output directory.
The final output directory is specified by ConfigurationKeys.DATA_PUBLISHER_FINAL_DIR
. The output of each
writer is written to this directory. Each individual writer can also specify a path in the config key
ConfigurationKeys.WRITER_FILE_PATH
. Then the final output data for a writer will be
ConfigurationKeys.DATA_PUBLISHER_FINAL_DIR
/ConfigurationKeys.WRITER_FILE_PATH
. If the
ConfigurationKeys.WRITER_FILE_PATH
is not specified, a default one is assigned. The default path is
constructed in the Extract.getOutputFilePath()
method.
This publisher records all dirs it publishes to in property ConfigurationKeys.PUBLISHER_DIRS
. Each time it
publishes a Path
, if the path is a directory, it records this path. If the path is a file, it records the
parent directory of the path. To change this behavior one may override
recordPublisherOutputDirs(Path, Path, int)
.
This is an updated copy of BaseDataPublisher
in gobblin-core module. This file should be deleted in favor of the upstream version when possible.
Updates are:
addWriterOutputToNewDir(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, org.apache.gobblin.configuration.WorkUnitState, int, org.apache.gobblin.util.ParallelRunner)
from publishData(org.apache.gobblin.configuration.WorkUnitState)
allowing to override the behavior
in TimePartitionedDataPublisher -- https://github.com/apache/gobblin/pull/3409Modifier and Type | Field and Description |
---|---|
protected com.google.common.io.Closer |
closer |
protected com.google.common.base.Optional<org.apache.gobblin.metrics.event.lineage.LineageInfo> |
lineageInfo |
protected Map<org.apache.gobblin.writer.PartitionIdentifier,org.apache.gobblin.metadata.MetadataMerger<String>> |
metadataMergers |
protected List<org.apache.hadoop.fs.FileSystem> |
metaDataWriterFileSystemByBranches |
protected int |
numBranches |
protected com.google.common.io.Closer |
parallelRunnerCloser |
protected Map<String,org.apache.gobblin.util.ParallelRunner> |
parallelRunners |
protected int |
parallelRunnerThreads |
protected List<org.apache.hadoop.fs.permission.FsPermission> |
permissions |
protected List<org.apache.hadoop.fs.FileSystem> |
publisherFileSystemByBranches |
protected List<com.google.common.base.Optional<String>> |
publisherFinalDirOwnerGroupsByBranches |
protected Set<org.apache.hadoop.fs.Path> |
publisherOutputDirs |
protected com.typesafe.config.Config |
retrierConfig |
protected boolean |
shouldRetry |
protected List<org.apache.hadoop.fs.FileSystem> |
writerFileSystemByBranches |
Constructor and Description |
---|
BaseDataPublisher(org.apache.gobblin.configuration.State state) |
Modifier and Type | Method and Description |
---|---|
protected void |
addSingleTaskWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir,
org.apache.hadoop.fs.Path publisherOutputDir,
org.apache.gobblin.configuration.WorkUnitState workUnitState,
int branchId,
org.apache.gobblin.util.ParallelRunner parallelRunner) |
protected void |
addWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir,
org.apache.hadoop.fs.Path publisherOutputDir,
org.apache.gobblin.configuration.WorkUnitState workUnitState,
int branchId,
org.apache.gobblin.util.ParallelRunner parallelRunner) |
protected void |
addWriterOutputToNewDir(org.apache.hadoop.fs.Path writerOutput,
org.apache.hadoop.fs.Path publisherOutput,
org.apache.gobblin.configuration.WorkUnitState workUnitState,
int branchId,
org.apache.gobblin.util.ParallelRunner parallelRunner) |
void |
close() |
protected org.apache.gobblin.dataset.DatasetDescriptor |
createDestinationDescriptor(org.apache.gobblin.configuration.WorkUnitState state,
int branchId)
Create destination dataset descriptor
|
protected org.apache.hadoop.fs.Path |
getPublisherOutputDir(org.apache.gobblin.configuration.WorkUnitState workUnitState,
int branchId)
Get the output directory path this
BaseDataPublisher will write to. |
void |
initialize() |
protected void |
movePath(org.apache.gobblin.util.ParallelRunner parallelRunner,
org.apache.gobblin.configuration.State state,
org.apache.hadoop.fs.Path src,
org.apache.hadoop.fs.Path dst,
int branchId) |
void |
publishData(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states) |
void |
publishData(org.apache.gobblin.configuration.WorkUnitState state) |
protected void |
publishData(org.apache.gobblin.configuration.WorkUnitState state,
int branchId,
boolean publishSingleTaskData,
Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved) |
void |
publishMetadata(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states)
Merge all of the metadata output from each work-unit and publish the merged record.
|
void |
publishMetadata(org.apache.gobblin.configuration.WorkUnitState state)
Publish metadata for each branch.
|
protected void |
publishMultiTaskData(org.apache.gobblin.configuration.WorkUnitState state,
int branchId,
Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved)
This method publishes task output data for the given
WorkUnitState , but if there are output data of
other tasks in the same folder, it may also publish those data. |
protected Collection<org.apache.hadoop.fs.Path> |
recordPublisherOutputDirs(org.apache.hadoop.fs.Path src,
org.apache.hadoop.fs.Path dst,
int branchId) |
protected boolean |
shouldPublishMetadataFirst()
The BaseDataPublisher relies on publishData() to create and clean-up the output directories, so data
has to be published before the metadata can be.
|
getInstance, publish
protected final int numBranches
protected final List<org.apache.hadoop.fs.FileSystem> writerFileSystemByBranches
protected final List<org.apache.hadoop.fs.FileSystem> publisherFileSystemByBranches
protected final List<org.apache.hadoop.fs.FileSystem> metaDataWriterFileSystemByBranches
protected final List<com.google.common.base.Optional<String>> publisherFinalDirOwnerGroupsByBranches
protected final List<org.apache.hadoop.fs.permission.FsPermission> permissions
protected final com.google.common.io.Closer closer
protected final com.google.common.io.Closer parallelRunnerCloser
protected final int parallelRunnerThreads
protected final Set<org.apache.hadoop.fs.Path> publisherOutputDirs
protected final com.google.common.base.Optional<org.apache.gobblin.metrics.event.lineage.LineageInfo> lineageInfo
protected final Map<org.apache.gobblin.writer.PartitionIdentifier,org.apache.gobblin.metadata.MetadataMerger<String>> metadataMergers
protected final boolean shouldRetry
protected final com.typesafe.config.Config retrierConfig
public BaseDataPublisher(org.apache.gobblin.configuration.State state) throws IOException
IOException
public void initialize() throws IOException
initialize
in class org.apache.gobblin.publisher.DataPublisher
IOException
public void close() throws IOException
IOException
protected org.apache.gobblin.dataset.DatasetDescriptor createDestinationDescriptor(org.apache.gobblin.configuration.WorkUnitState state, int branchId)
public void publishData(org.apache.gobblin.configuration.WorkUnitState state) throws IOException
publishData
in class org.apache.gobblin.publisher.SingleTaskDataPublisher
IOException
public void publishData(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states) throws IOException
publishData
in class org.apache.gobblin.publisher.DataPublisher
IOException
protected void publishMultiTaskData(org.apache.gobblin.configuration.WorkUnitState state, int branchId, Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved) throws IOException
WorkUnitState
, but if there are output data of
other tasks in the same folder, it may also publish those data.IOException
protected void publishData(org.apache.gobblin.configuration.WorkUnitState state, int branchId, boolean publishSingleTaskData, Set<org.apache.hadoop.fs.Path> writerOutputPathsMoved) throws IOException
IOException
protected org.apache.hadoop.fs.Path getPublisherOutputDir(org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId)
BaseDataPublisher
will write to.
This is the default implementation. Subclasses of BaseDataPublisher
may override this
to write to a custom directory or write using a custom directory structure or naming pattern.
workUnitState
- a WorkUnitState
objectbranchId
- the fork branch IDBaseDataPublisher
will write toprotected void addSingleTaskWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir, org.apache.hadoop.fs.Path publisherOutputDir, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner) throws IOException
IOException
protected void addWriterOutputToNewDir(org.apache.hadoop.fs.Path writerOutput, org.apache.hadoop.fs.Path publisherOutput, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner) throws IOException
IOException
protected void addWriterOutputToExistingDir(org.apache.hadoop.fs.Path writerOutputDir, org.apache.hadoop.fs.Path publisherOutputDir, org.apache.gobblin.configuration.WorkUnitState workUnitState, int branchId, org.apache.gobblin.util.ParallelRunner parallelRunner) throws IOException
IOException
protected void movePath(org.apache.gobblin.util.ParallelRunner parallelRunner, org.apache.gobblin.configuration.State state, org.apache.hadoop.fs.Path src, org.apache.hadoop.fs.Path dst, int branchId) throws IOException
IOException
protected Collection<org.apache.hadoop.fs.Path> recordPublisherOutputDirs(org.apache.hadoop.fs.Path src, org.apache.hadoop.fs.Path dst, int branchId) throws IOException
IOException
public void publishMetadata(Collection<? extends org.apache.gobblin.configuration.WorkUnitState> states) throws IOException
publishMetadata
in class org.apache.gobblin.publisher.DataPublisher
states
- States from all tasksIOException
- If there is an error publishing the filepublic void publishMetadata(org.apache.gobblin.configuration.WorkUnitState state) throws IOException
publishMetadata
in class org.apache.gobblin.publisher.SingleTaskDataPublisher
IOException
protected boolean shouldPublishMetadataFirst()
shouldPublishMetadataFirst
in class org.apache.gobblin.publisher.DataPublisher
Copyright © 2021. All rights reserved.