Linux

PySpark – OrderBy()

In Python, PySpark is a spark module used to provide a similar kind of processing like spark using DataFrame. In PySpark, orderBy() is used to arrange the rows in sorting/ascending order in the DataFrame.

It will return the new dataframe by arranging the rows in the existing dataframe.

Let’s create a PySpark DataFrame.

Example:

In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

#display dataframe

df.show()

Output:

Method – 1: Using orderBy()

Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. It will take one or more columns.

Syntax:

dataframe.orderBy(“column_name”,………, “column_name”)

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.orderBy(“address”,“age”).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 2: Using orderBy() with Col Function

Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column names/s inside the orderBy() function through the col function. We have to import this function from pyspark.sql.functions module. This is used to read a column from the PySpark DataFrame.

Syntax:

dataframe.orderBy(col(“column_name”),………, col(“column_name”))

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied through the col function.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#import the col function

from pyspark.sql.functions import col

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.orderBy(col(“address”),col(“age”)).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 3: Using orderBy() with DataFrame Label

Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column names/labels inside the orderBy() function through the DataFrame column name/label.

Syntax:

dataframe.orderBy(dataframe.column_name,………, dataframe.column_name)

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_name is the column where sorting is applied.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.orderBy(df.address,df.age).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Method – 4: Using orderBy() with DataFrame Index

Here, we are using the orderBy() function to sort the PySpark DataFrame based on the columns. We have to specify the column index/indices inside the orderBy() function through the DataFrame column index/position. In DataFrame, indexing starts with ‘0’.

Syntax:

dataframe.orderBy(dataframe[column_index],………, dataframe[column_index])

Here,

  1. dataframe is the input PySpark DataFrame.
  2. column_index is the column position where sorting is applied.

Example:

In this example, we are going to sort the dataframe based on address and age columns with the orderBy() function and display the sorted dataframe using the collect() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

 

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

 

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:‘001’,‘name’:‘sravan’,‘age’:23,‘height’:5.79,‘weight’:67,‘address’:‘guntur’},

 {‘rollno’:‘002’,‘name’:‘ojaswi’,‘age’:16,‘height’:3.79,‘weight’:34,‘address’:‘hyd’},

 {‘rollno’:‘003’,‘name’:‘gnanesh chowdary’,‘age’:7,‘height’:2.79,‘weight’:17,
‘address’:‘patna’},

 {‘rollno’:‘004’,‘name’:‘rohith’,‘age’:9,‘height’:3.69,‘weight’:28,‘address’:‘hyd’},

 {‘rollno’:‘005’,‘name’:‘sridevi’,‘age’:37,‘height’:5.59,‘weight’:54,‘address’:‘hyd’}]

 

# create the dataframe

df = spark_app.createDataFrame( students)

 

# sort the dataframe based on address and age columns

# and display the sorted dataframe

df.orderBy(df[0],df[1]).collect()

Output:

[Row(address=‘guntur’, age=23, height=5.79, name=‘sravan’, rollno=‘001’, weight=67),

Row(address=‘hyd’, age=9, height=3.69, name=‘rohith’, rollno=‘004’, weight=28),

Row(address=‘hyd’, age=16, height=3.79, name=‘ojaswi’, rollno=‘002’, weight=34),

Row(address=‘hyd’, age=37, height=5.59, name=‘sridevi’, rollno=‘005’, weight=54),

Row(address=‘patna’, age=7, height=2.79, name=‘gnanesh chowdary’, rollno=‘003’, weight=17)]

Conclusion

In this article, we discuss how to use the orderBy() function using four scenarios on the PySpark dataframe in Python. Finally, we came to a point where we can sort the data in the PySpark Dataframe based on the columns present in the DataFrame.


Source link

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button