Linux

PySpark – Sort()

In Python, PySpark is a spark module used to provide a similar kind of processing like spark using DataFrame. In PySpark, sort() is used to arrange the rows in sorting or ascending order in the DataFrame. It will return the new dataframe by arranging the rows in the existing dataframe. Let’s create a PySpark DataFrame.

Example:

In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

#display dataframe

df.show()

Output:

Method – 1: Using sort()

Here, we are using the sort() function, to sort the PySpark DataFrame based on the columns. It will take one or more columns.

Syntax:

dataframe.sort(“column_name”,………, “column_name”)

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.sort(“address”,“age”).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 2: Using sort() with Col Function

Here, we are using the sort() function, to sort the PySpark DataFrame based on the columns. We have to specify the column names/s inside the sort() function through the col function. We have to import this function from pyspark.sql.functions module. This is used to read a column from the PySpark DataFrame.

Syntax:

dataframe.sort(col(“column_name”),………, col(“column_name”))

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied through the col function.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#import the col function

from pyspark.sql.functions import col

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.sort(col(“address”),col(“age”)).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 3: Using sort() with DataFrame Label

Here, we are using the sort() function to sort the PySpark DataFrame based on the columns. We have to specify the column names/labels inside the sort() function through the DataFrame column name/label.

Syntax:

dataframe.sort(dataframe.column_name,………, dataframe.column_name)

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.sort(df.address,df.age).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 4: Using sort() with DataFrame Index

Here, we are using the sort() function, to sort the PySpark DataFrame based on the columns. We have to specify the column index/indices inside the sort() function through the DataFrame column index/position. In DataFrame, indexing starts with ‘0’.

Syntax:

dataframe.sort(dataframe[column_index],………, dataframe[column_index])

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_index is the column position where sorting is applied.

In this example, we are going to sort the dataframe based on address and age columns with the sort() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.sort(df[0],df[1]).collect()

 

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Conclusion

In this article, we discuss how to use a sort() function using four scenarios on the PySpark dataframe in Python. Finally, we came to a point where we can sort the data in the PySpark Dataframe based on the columns present in the DataFrame.


Source link

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button