Combining data with merge(), join(), and concat() methods in Pandas
Pandas library comes along with Series and DataFrame objects. And, these two are powerful intermediatory tools to explore and analyze the data. Data from million rows to just dozens, you can create a DataFrame for your analysis.
For a basic idea, Series object can be thought of as a column and a DataFrame object can be seen as a collection of Series in a table-like structure. At times, we get to see separate datasets for analysis. So, it sometimes becomes necessary to unify them. Hence, Pandas methods like merge, join and concatenate come in very handy.
This post discusses those methods and talks about how to use them in our analysis process.
Combine dataframes on common columns or indexes
Pandas merge() method/function offers join operations just like you see in databases. In a relation database, you can combine tables on one or more key columns. And, that’s precisely what merge() does in Pandas. When you need to join two or more data frames based on one or more keys, merge() is your go-to method.
merge() method requires two datasets: Left DataFrame and RIght DataFrame. These two are required and along with them, you can also make use of other parameters.
- how: It defines the merge type you want to apply. It can be a left, right, inner, or an outer merge. By default, it is an inner merge.
- on: On which columns or indexes the merge will happen
- left_on/right_on: Column or index level names to join on in the left/right DataFrame
- left_index/right_index: Use the index from the left/right DataFrame as the join key
- suffixes: if columns overlap, suffixes will be added to the left and right columns
Combine dataframes on a key column or index
A common difference that merge() and join() has is that merge() is a module function while join() is an object function. The join() method only specifies the DataFrame. And, that DataFrame will join the DataFrame you call .join() on.
Also, join() uses merge() functionalities underneath the hood but it is a more efficient way to join DataFrames. Like merge(), join() also takes some extra parameters:
- other: Defining other DataFrame
- on: On which column or index join will be applied
- how: just like merge(). join() is index-based but if you specify columns in on parameter
- lsuffix/rsuffix: Adding a suffix to left/right overlapping columns
left_dataframe.join(right_dataframe, lsuffix="_left", rsuffix="_right")
Combine dataframes across columns or rows
Concat is just like string concatenation. Whether columns or rows, we can concatenate them in the existing dataset. Unlike merge() or join(), concatenate just stick the data frames together. That depends on the axis- can be row axis or column axis!
result = pandas.concat([df1, df2])
The concatenation operation will occur on rows axis when no parameters are defined.
result = pandas.concat([df1, df2],axis=1)
The above concatenation will be based on the column axis. Here are some other parameters you can give in the concat() method.
- obj: Takes dataframes that are to be concatenated.
- axis: axis you will concatenate along; 0 for rows and 1 for column
- join: accepts inner or outer for joining condition. By default, it is set as outer.
In the post, different datasets combining techniques are discussed. Pandas library offers three functions- merge(), join(), and concat() which helps in joining or combining different datasets for exploring or analysis part.