Read complex/nested xml file in Data Bricks

In this post we will learn reading of complex or nested xml file which has different nodes. For this post we will take example of below xml data.

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <ExportDate>2022-01-26</ExportDate>
   <EmployeeList>
		<Employee>
			<employeeid>1</employeeid>
			<firstname>Ram</firstname>
			<lastname>Kumar</lastname>
			<gender>M</gender>
			<age>25</age>
			<address>
				<addressLine>Block 1</addressLine>
				<city>Delhi</city>
				<state>Delhi</state>
				<country>India</country>
			</address>
		</Employee>
		<Employee>
			<employeeid>2</employeeid>
			<firstname>Manisha</firstname>
			<lastname>Rator</lastname>
			<gender>F</gender>
			<age>32</age>
			<address>
				<addressLine>Z Tower</addressLine>
				<city>Gurugram</city>
				<state>Haryana</state>
				<country>India</country>
			</address>
		</Employee>
   </EmployeeList>
</root>

To read data from this xml into one single table we can use custom schema. In custom schema we will define schema for Address then we will combine Address Schema into Employee schema.

Defining Custom Schema
AddressSchema = StructType([
                    StructField("addressLine", StringType()),
                    StructField("city", StringType()),
                    StructField("state", StringType()),
                    StructField("country", StringType())
                ])

EmployeeSchema = StructType([
                    StructField("employeeid", IntegerType()),
                    StructField("firstname", StringType()),
                    StructField("lastname", StringType()),
                    StructField("gender", StringType()),
                    StructField("age", IntegerType()),
                    StructField("address", AddressSchema )
                ])

By defining schema in this manner we will have tree structure like in xml into data frame and once we have tree structure data in our data frame we can then use explode function to break these object into columns.

Please see below code to read data and the output of the code.

EmployeeDF = spark.read.format("xml") \
    .option("rootTag","root") \
    .option("rowTag", "Employee") \
    .schem a(EmployeeSchema)\
    .load(b)

display(EmployeeDF)
xml read output

Now we have data in our dataframe and we can modify our data frame to read data from address object and can have data in tabular view.

Please find below code to convert this data frame into tabular view data frame.

EmployeeDFNew = EmployeeDF.withColumn("address", explode(array("address"))) \
                .select("employeeid",
                        "firstname",
                        "lastname",
                        "gender",
                        "age",
                        col("address.addressLine"), 
                        col("address.city"), 
                        col("address.state"), 
                        col("address.country")
                )
display(EmployeeDFNew)
Output of tabular view of input xml

This is how we can read data from xml in Data Bricks.

Please let me know your question or feedback in comments.