Package org.apache.cassandra.spark.data
Class DataLayer
- java.lang.Object
-
- org.apache.cassandra.spark.data.DataLayer
-
- All Implemented Interfaces:
java.io.Serializable
- Direct Known Subclasses:
LocalDataLayer,PartitionedDataLayer
public abstract class DataLayer extends java.lang.Object implements java.io.Serializable- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static longserialVersionUID
-
Constructor Summary
Constructors Constructor Description DataLayer()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description org.apache.cassandra.bridge.BigNumberConfigbigNumberConfig(org.apache.cassandra.spark.data.CqlField field)DataLayer can override this method to return the BigInteger/BigDecimal precision/scale values for a given columnabstract org.apache.cassandra.bridge.CassandraBridgebridge()abstract org.apache.cassandra.spark.data.CqlTablecqlTable()protected abstract java.util.concurrent.ExecutorServiceexecutorService()DataLayer implementation should provide an ExecutorService for doing blocking I/O when opening SSTable readers.abstract booleanisInPartition(int partitionId, java.math.BigInteger token, java.nio.ByteBuffer key)abstract java.lang.StringjobId()org.apache.cassandra.spark.reader.StreamScanneropenCompactionScanner(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter)org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.RowData>openCompactionScanner(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter, org.apache.cassandra.spark.sparksql.filters.PruneColumnFilter columnFilter)org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.IndexEntry>openPartitionSizeIterator(int partitionId)abstract intpartitionCount()abstract org.apache.cassandra.spark.data.partitioner.Partitionerpartitioner()java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter>partitionKeyFiltersInRange(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)org.apache.spark.sql.types.StructTypepartitionSizeStructType()booleanreadIndexOffset()When true the SSTableReader should attempt to find the offset into the Data.db file for the Spark worker's token range.java.util.List<org.apache.cassandra.spark.config.SchemaFeature>requestedFeatures()org.apache.cassandra.spark.sparksql.filters.SparkRangeFiltersparkRangeFilter(int partitionId)DataLayer implementation should provide a SparkRangeFilter to filter out partitions and mutations that do not overlap with the Spark worker's token rangeabstract org.apache.cassandra.spark.data.SSTablesSuppliersstables(int partitionId, org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFiltersstableTimeRangeFilter()ReturnsSSTableTimeRangeFilterto filter out SSTables based on min and max timestamp.org.apache.cassandra.analytics.stats.Statsstats()Override to plug in your own Stats instrumentation for recording internal eventsorg.apache.spark.sql.types.StructTypestructType()Map Cassandra CQL table schema to SparkSQL StructTypeabstract org.apache.cassandra.spark.utils.TimeProvidertimeProvider()org.apache.cassandra.spark.data.converter.SparkSqlTypeConvertertypeConverter()org.apache.spark.sql.sources.Filter[]unsupportedPushDownFilters(org.apache.spark.sql.sources.Filter[] filters)booleanuseIncrementalRepair()When true the SSTableReader should only read repaired SSTables from a single 'primary repair' replica and read unrepaired SSTables at the user set consistency levelorg.apache.cassandra.bridge.CassandraVersionversion()
-
-
-
Field Detail
-
serialVersionUID
public static final long serialVersionUID
- See Also:
- Constant Field Values
-
-
Method Detail
-
partitionSizeStructType
public org.apache.spark.sql.types.StructType partitionSizeStructType()
- Returns:
- SparkSQL table schema expected for reading Partition sizes with PartitionSizeTableProvider.
-
structType
public org.apache.spark.sql.types.StructType structType()
Map Cassandra CQL table schema to SparkSQL StructType- Returns:
- StructType representation of CQL table
-
requestedFeatures
public java.util.List<org.apache.cassandra.spark.config.SchemaFeature> requestedFeatures()
-
bigNumberConfig
public org.apache.cassandra.bridge.BigNumberConfig bigNumberConfig(org.apache.cassandra.spark.data.CqlField field)
DataLayer can override this method to return the BigInteger/BigDecimal precision/scale values for a given column- Parameters:
field- the CQL field- Returns:
- a BigNumberConfig object that specifies the desired precision/scale for BigDecimal and BigInteger
-
version
public org.apache.cassandra.bridge.CassandraVersion version()
- Returns:
- Cassandra version (3.0, 4.0 etc)
-
bridge
public abstract org.apache.cassandra.bridge.CassandraBridge bridge()
- Returns:
- version-specific CassandraBridge wrapping shaded packages
-
typeConverter
public org.apache.cassandra.spark.data.converter.SparkSqlTypeConverter typeConverter()
- Returns:
- SparkSQL type converter that maps version-specific Cassandra types to SparkSQL types
-
partitionCount
public abstract int partitionCount()
-
cqlTable
public abstract org.apache.cassandra.spark.data.CqlTable cqlTable()
- Returns:
- CqlTable object for table being read, batch/bulk read jobs only
-
isInPartition
public abstract boolean isInPartition(int partitionId, java.math.BigInteger token, java.nio.ByteBuffer key)
-
timeProvider
public abstract org.apache.cassandra.spark.utils.TimeProvider timeProvider()
- Returns:
- a TimeProvider
-
partitionKeyFiltersInRange
public java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFiltersInRange(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters) throws org.apache.cassandra.spark.sparksql.NoMatchFoundException- Throws:
org.apache.cassandra.spark.sparksql.NoMatchFoundException
-
sparkRangeFilter
public org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter(int partitionId)
DataLayer implementation should provide a SparkRangeFilter to filter out partitions and mutations that do not overlap with the Spark worker's token range- Parameters:
partitionId- the partitionId for the task- Returns:
- SparkRangeFilter for the Spark worker's token range
-
sstableTimeRangeFilter
@NotNull public org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter()
ReturnsSSTableTimeRangeFilterto filter out SSTables based on min and max timestamp.- Returns:
SSTableTimeRangeFilter
-
executorService
protected abstract java.util.concurrent.ExecutorService executorService()
DataLayer implementation should provide an ExecutorService for doing blocking I/O when opening SSTable readers. It is the responsibility of the DataLayer implementation to appropriately size and manage this ExecutorService.- Returns:
- executor service
-
sstables
public abstract org.apache.cassandra.spark.data.SSTablesSupplier sstables(int partitionId, @Nullable org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter, @NotNull java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)- Parameters:
partitionId- the partitionId of the tasksparkRangeFilter- spark range filterpartitionKeyFilters- the list of partition key filters- Returns:
- set of SSTables
-
partitioner
public abstract org.apache.cassandra.spark.data.partitioner.Partitioner partitioner()
-
jobId
public abstract java.lang.String jobId()
- Returns:
- a string that uniquely identifies this Spark job
-
openCompactionScanner
public org.apache.cassandra.spark.reader.StreamScanner openCompactionScanner(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter)
-
readIndexOffset
public boolean readIndexOffset()
When true the SSTableReader should attempt to find the offset into the Data.db file for the Spark worker's token range. This works by first binary searching the Summary.db file to find offset into Index.db file, then reading the Index.db from the Summary.db offset to find the first offset in the Data.db file that overlaps with the Spark worker's token range. This enables the reader to start reading from the first in-range partition in the Data.db file, and close after reading the last partition. This feature improves scalability as more Spark workers shard the token range into smaller subranges. This avoids wastefully reading the Data.db file for out-of-range partitions.- Returns:
- true if, the SSTableReader should attempt to read Summary.db and Index.db files to find the start index offset into the Data.db file that overlaps with the Spark workers token range
-
useIncrementalRepair
public boolean useIncrementalRepair()
When true the SSTableReader should only read repaired SSTables from a single 'primary repair' replica and read unrepaired SSTables at the user set consistency level- Returns:
- true if the SSTableReader should only read repaired SSTables on single 'repair primary' replica
-
openCompactionScanner
public org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.RowData> openCompactionScanner(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter, @Nullable org.apache.cassandra.spark.sparksql.filters.PruneColumnFilter columnFilter)- Returns:
- CompactionScanner for iterating over one or more SSTables, compacting data and purging tombstones
-
openPartitionSizeIterator
public org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.IndexEntry> openPartitionSizeIterator(int partitionId)
- Parameters:
partitionId- Spark partition id- Returns:
- a PartitionSizeIterator that iterates over Index.db files to calculate partition size.
-
unsupportedPushDownFilters
public org.apache.spark.sql.sources.Filter[] unsupportedPushDownFilters(org.apache.spark.sql.sources.Filter[] filters)
- Parameters:
filters- array of push down filters that- Returns:
- an array of push filters that are not supported by this data layer
-
stats
public org.apache.cassandra.analytics.stats.Stats stats()
Override to plug in your own Stats instrumentation for recording internal events- Returns:
- Stats implementation to record internal events
-
-