Skip to content

tfrecord write results in no data but no error #46

@dennisobrien

Description

@dennisobrien

Hi -- I am trying to use spark-tfrecord with Spark 3.1.2, but the files written have no data.

  • Spark 3.1.2
  • Python 3.8.10
  • Java 1.8.0
  • Scala 2.12.10

I'm using the latest version available from the maven repo as:

<dependency>
    <groupId>com.linkedin.sparktfrecord</groupId>
    <artifactId>spark-tfrecord_2.12</artifactId>
    <version>0.3.4</version>
</dependency>

Following the pyspark example from the README but simplified further:

path = "/tmp/test-output.tfrecord"

fields = [
    StructField("a", IntegerType()),
    StructField("b", FloatType()),
    StructField("c", StringType()),
]
schema = StructType(fields)
test_rows = [
    [1, 0.5, 'x'],
    [2, 1.5, 'y'],
    [3, 2.5, 'z'],
]
rdd = spark.sparkContext.parallelize(test_rows)
df = spark.createDataFrame(rdd, schema)
df.show()

Outputs:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|0.5|  x|
|  2|1.5|  y|
|  3|2.5|  z|
+---+---+---+

Saving the spark dataframe to tfrecord does not throw an error.

path = "/tmp/test-output.tfrecord/"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(path)

But the directory only has a _SUCCESS flag and a crc file, no data.

ls -la /tmp/test-output.tfrecord/
total 12
drwxr-xr-x.  2 build build 4096 Feb 19 19:00 .
drwxrwxrwx. 11 root  root  4096 Feb 19 19:00 ..
-rw-r--r--.  1 build build    0 Feb 19 19:00 _SUCCESS
-rw-r--r--.  1 build build    8 Feb 19 19:00 ._SUCCESS.crc

And of course, trying to read the file fails.

spark.read.format('tfrecord').option('recordType', 'Example').load(path).show()

Error:

AnalysisException: Unable to infer schema for TFRECORD. It must be specified manually.

Let me know if there is more system/config information that could help to debug this.

FWIW, I had the exact same situation when testing spark-tensorflow-connector which I was building from source. I figured there was something wrong with my dependencies or something and thought I would try this project.

thanks,
Dennis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions