STATA training manual 1

What is Stata

Stata, pronounced “stay-tuh” (or “stah-tah” by many Malawians), is a robust statistical software that offers advanced data management capabilities, a comprehensive selection of modern statistical methods, and exceptional tools for creating publication-quality graphs The name Stata is derived from a variation of the word "Statistics" rather than being an abbreviation Known for its speed and user-friendly interface, Stata will be thoroughly explored in this training course.

Why Use Stata?

There are numerous comparable statistical packages such as SPSS, R, SAS, Matlab, Eviews, etc

When considering whether to use Stata, it's essential to recognize its key strengths: the ability to efficiently handle and manipulate large datasets, including millions of observations, and its expanding capabilities for panel and time-series regression analysis.

Stata 13, released in 2014, boasts significant improvements in computing speed, capabilities, and functionality, along with enhanced graphics options The software is frequently updated by users with specific needs, allowing for easy access to custom programs for various regression analyses that can be seamlessly integrated One user noted that Stata feels "very interactive," allowing for a conversational experience where the software responds precisely to user commands, making it a preferred choice over other statistical packages.

Types of Stata

Stata offers four distinct versions: Stata MP (Multi-Processor), the most powerful option, followed by Stata SE (Special Edition), Stata Intercooled (IC), and Stata Small The key differences among these versions lie in their maximum capacity for handling variables, regressors, and observations Understanding these variations is crucial for making an informed purchasing decision, especially when advising your organization on which Stata type to acquire The table below provides a summary of the characteristics of each Stata version.

Stata/MP 32,767 10,998 2,147,583,647*  Runs on multiple CPUs or cores, from 2 to 64 but can also run on single core The number of cores depends on the licence

 Fastest version of Stata Stata/SE 32,767 10,998 2,147,583,647*  Run on single core

 Can run on multiple core computers but uses only single core

Stata/IC 2,047 798 2,147,583,647*  Run on single core

Small Stata 99 99 1,200  Run on single core

*Assuming you have enough memory

In this Training Course, we are going to use Stata/SE (Version 13)

What can Stata do?

Stata is primarily a command-driven software package, and while the latest versions offer pull-down menus for selecting various commands, the most effective way to learn Stata remains through typing commands directly.

Utilizing commands in Stata enhances the programming experience for those engaged in econometric and statistical work, offering greater interactivity and flexibility compared to pull-down menus To fully leverage Stata's capabilities, it is recommended to type commands directly, although mastering the exact syntax can be challenging In such instances, users can execute a command via the menu, then copy the generated syntax from the Command Window Additionally, the HELP function provides valuable guidance on command syntax, facilitating a smoother learning process.

This section provides an overview of the Stata interface and its capabilities, offering a glimpse into the software's functionality We will explore various tasks using both menus and commands, allowing you to become acquainted with Stata's features and operations This introductory approach will help you understand the basics of what Stata can accomplish.

Appendices 2a and 2c summarize some of the basic introductory features one need to know (most, if not all, of them already presented above).

The Stata Interface

Stata Windows

Stata features five primary windows: Review, Results, Command, Variables, and Properties, with all but the Results window displaying their names in the title bar These windows are consistently in use throughout your Stata session, alongside specialized windows like the Viewer, Data Editor, Variables Manager, Do-file Editor, Graph, and Graph Editor The Command window allows users to submit commands, offering basic text editing capabilities, copying and pasting, and a command history for easy recall and editing of previously submitted commands The Results window displays all commands and their textual results from the session, enabling users to scroll through or search for specific outputs efficiently.

To access the Results window, use the find bar, which is hidden by default You can reveal it by navigating to Edit > Find Additionally, you have the option to clear the Results window whenever necessary.

8 clicking in the Results window and selecting Clear Results from the contextual menu

The Review window provides a history of entered commands, displaying successful commands in black and unsuccessful ones in red, along with their error codes Users can toggle the visibility of tools using the filter button in the title bar, and the "Filter commands here" field allows for filtering of commands based on entered text, with case sensitivity ignored by default The wrench icon enables customization of the filtering behavior, while the exclamation mark button can hide commands that encountered errors Importantly, no commands are deleted; only their visibility is affected.

To execute a command from the Review window, simply click once on a past command to copy it to the Command window, or double-click to resubmit it, with executed commands appearing at the bottom of the Review window Right-clicking in the Review window offers a menu for various actions Additionally, the Variables window displays all variables in the dataset along with their properties, which can be customized by right-clicking on the column headers.

In the Variables window, users can easily select multiple variables by either Ctrl-clicking on nonadjacent variables or using Shift-click to select a range of variables Double-clicking a variable inserts it at the current position in the Command window Additionally, the Variables window offers options for filtering and reordering the displayed variables.

In the Variables window, you can easily reorder variables by clicking on any column header Right-clicking on a variable reveals a helpful menu The Properties window shows the properties of the selected variable or dataset; if one variable is chosen, its specific properties are displayed, while multiple selected variables will show only the common properties among them.

To open any window or to reveal a window hidden by other windows, select the window from the Window menu, or select the proper item from the toolbar

The Tool Bars

The toolbar contains buttons that provide quick access to Stata’s more commonly used features.

If you're unsure about the function of a button, simply hover your mouse pointer over it to reveal a tooltip that describes its purpose Buttons featuring both an icon and an arrow will open a menu when you click on the arrow Below is a summary of the toolbar buttons and their respective functions.

Open: opens a Stata dataset Click on the button to open a dataset with the Open dialog

Save: saves the Stata dataset currently in memory to disk

Print: displays a list of windows Select a window name to print its contents

Log: begins a new log or closes, suspends, or resumes the current log

To open the Viewer or bring it to the forefront of your windows, simply click the designated button For additional options, click the arrow to select a specific Viewer to display prominently.

To bring the Graph window to the forefront of all other windows, simply click the designated button Additionally, you can select a specific Graph window by clicking the arrow, allowing you to easily access and display your desired Graph.

The Do-file Editor allows users to open or bring it to the forefront of their workspace, enhancing workflow efficiency By clicking the button, a new Do-file Editor can be initiated, while the arrow option enables users to select and display an existing Do-file Editor window.

Data Editor (Edit): opens the Data Editor or brings the Data Editor to the front of the other Stata windows

Data Editor (Browse): opens the Data Editor in browse mode

Variables Manager: opens the Variables Manager

Clear more Condition: tells Stata to continue when it has paused in the middle of long output

Break: stops the current task in Stata.

Menus and dialogs

Stata offers two primary methods for executing commands: through menus and dialogs or via the Command window Users can easily access nearly all Stata commands using the point-and-click features found in the Data, Graphics, and Statistics menus For example, instead of typing the regress command, you can simply select it from the menu options.

Statistics > Linear models and related > linear regression

The Stata regress command dialog offers comprehensive access to its functionalities When using the dialog for the first time, it's beneficial to explore each tab to familiarize yourself with its full range of capabilities.

The dialogs for many commands have the by/if/in and Weights tabs These provide access to

Stata’s commands and qualifiers for controlling the estimation sample and dealing with weighted data

When a command is issued through a dialog in Stata, it is processed as if it were manually typed The command can be viewed in both the Results window and the Review window after execution, allowing users to examine the complete command and enhance their understanding of Stata's command syntax.

You can access Stata command dialogs not only through the menus but also by using alternative methods If you know the name of a Stata command but can't remember its menu location, simply type "db commandname" to open the corresponding dialog For instance, typing "db regress" will launch the dialog box for the regress command.

3 How to load your dataset from disk and save it to disk

Reading Data into Stata

Stata offers various methods for importing data, with the first three methods designed for datasets already in Stata format, while the fourth method accommodates data in alternative formats.

1 Double click the Stata data file to load it into Stata

To load a file into Stata, navigate to the "File" menu and select "Open." In the "Open" window, locate your desired file on your computer You can either click on the file and then select "Open" or simply double-click the file to import it into Stata.

3 You can also use the “open” icon on the manu bar to do the same as 2) above

To import files in different formats into Stata, navigate to the "File" menu in the menu bar, select "Export," and then choose your desired dataset format An import dialog box will appear, allowing you to select the appropriate options before clicking "OK." For instance, the dialog box for importing Excel files is displayed below.

The command window in Stata allows users to load data efficiently by utilizing the 'use' command for Stata format files, while the 'import' and 'insheet' commands are available for other data formats.

Examples: a use "C:\Users\user\Desktop\data_phd\New data2\responserate.dta", clear

The clear option will clear any dataset currently in memory before opening the new dataset

12 b import excel "C:\Users\user\Desktop\data_phd\New data2\responserate.xls", sheet("Sheet1") firstrow

The "Sheet1" option specifies that the data is located in the first sheet of the Excel file, while the "firstrow" option indicates that the first row contains the variable names The data can be imported using the command: insheet "C:\Users\user\Desktop\data_phd\New data2\responserate.csv".

Note: The insheet command appears to work only for csv files

Stata can only have one dataset open at a time, meaning that opening a new dataset will cause the currently open dataset to be discarded If there are unsaved changes in the current dataset, Stata will not discard it unless you force the action When opening a file through methods other than the Command window, a prompt will appear, while using the Command window with modified data will result in an error message.

These behaviors protect you from mistakenly losing data

To save an unnamed dataset (or an old dataset under a new name):

1 select File > Save As ; or

2 type save filename in the Command window

To save a dataset for use with Stata 11 or Stata 12 (Stata 11 can load Stata 12 datasets),

1 select File > Save As , and select Stata 12 Data (*.dta) from the Save as type list; or

2 type saveold filename in the Command window.

Saving data in Stata

To save a dataset that has been changed (overwriting the original data file),

2 click on the Save button; or

3 type save, replace in the Command window.

Overwriting a dataset makes it impossible to recover the original data, so it’s crucial to either keep a backup of the original filename.dta or save changes under a new name This process is similar to handling a word-processing document, but unlike text files, recovering an inadvertently saved dataset is extremely challenging.

It's essential to remember that any modifications made to a dataset are temporary until saved; you are working with an in-memory copy rather than the actual data file This approach is consistent with how most computer applications operate If you decide not to save your changes, you can easily clear the current dataset in memory and load a new one by entering the command "use filename, clear."

To save a dataset for the first time

1 select File > Save As and use the Save As dialog box to complete the saving.

Getting Help in Stata

Manuals

Stata offers comprehensive User Manuals that guide users in utilizing the software effectively In addition to the main manual, there are specialized manuals available for various topics, including the Graphics Manual, Panel Data Manual, and Survey Data Manual Users can obtain these manuals in print or access them digitally through the help command within Stata.

Stata In-Built Help and Website

Stata includes a built-in abbreviated manual accessible through the Help menu, which outlines major topics Additionally, users can find a valuable Frequently Asked Questions (FAQ) section on Stata's website at http://www.stata.com/support/faqs/ For specific command syntax and descriptions, users can simply type `help [topic]` in the command line.

Stata provides a comprehensive training program where we will demonstrate the learning process Additionally, you can find valuable resources for mastering Stata on their official website, which includes a curated list of helpful links for users.

The Web

The internet is an invaluable resource for problem-solving, and Google is a key tool in this process Stata offers a comprehensive website at [stata.com](http://www.stata.com), where users can access all official datasets online, among other valuable resources.

Stata provides valuable resources for users, including the Stata Journal and a well-stocked bookstore featuring texts on Stata and related statistical topics They offer convenient online training through NetCourses, which can be accessed via email and the web Additionally, there are numerous chat rooms dedicated to Stata commands, and many authors share new programs on their personal websites.

There is a dedicated listserv at Harvard School of Public Health for posting questions and receiving quick, knowledgeable responses from users To join, visit [Statalist](http://www.stata.com/statalist/) for subscription instructions The discussions are archived by Stata, Harvard University, and Yahoo.

UCLA offers a comprehensive Stata portal at http://www.ats.ucla.edu/stat/stata/, featuring a variety of resources to enhance your Stata skills Key offerings include a starter kit with "class notes with movies," instructional materials that blend written content with online videos, and topic-specific links for practical guidance on common tasks Additionally, the portal provides advanced learning modules, some accompanied by videos, as well as comparisons of Stata with other statistical software like SAS and SPSS.

Colleagues

Engaging with colleagues who have expertise in Stata can be incredibly beneficial for your learning process Often, we may invest hours or even days attempting to solve a problem, unaware that assistance is just a phone call or a short walk away Don't hesitate to reach out for help—collaboration can significantly enhance your understanding and efficiency.

Stata Documentation: Keeping Track of Things

Do-file

Stata features a built-in text editor known as the Do-file Editor, designed for various tasks This tool derives its name from "do-file," which refers to a file that contains a sequence of commands for Stata to execute, similar to a batch file or script in other programming environments.

The Do-file Editor offers advanced features for writing and organizing commands, allowing users to compile a series of commands to be executed in Stata simultaneously This functionality is particularly useful for creating loops to process multiple variables consistently or for handling complex, repetitive tasks efficiently.

A do-file can be launched by either clicking on the Do-file editor button or by typing doedit in the command window

Do-files have a toolbar as shown below:

New: Open a new do-file in a new tab in the Do-file Editor

Open: Open a do-file from disk in a new tab in the Do-file Editor

Save: Save the current do-file to disk

Print: Print the contents of the Do-file Editor

Find: Open the Find dialog for finding text

Cut: Cut the selected text and put it in the Clipboard

Copy: Copy the selected text to the Clipboard

Paste: Paste the text from the Clipboard into the current document

Undo: Undo the last change

Redo: Undo the last undo

Toggle Bookmark allows you to enable or disable bookmarks on the current line, facilitating quick navigation within your do-file This feature proves especially beneficial for lengthy do-files or during the debugging process.

Previous Bookmark: Go to the previous bookmark (if any)

Next Bookmark: Go to the next bookmark (if any)

The "Show File in Viewer" feature allows users to display the contents of a do-file in a Viewer window, making it particularly useful for editing files that include SMCL tags, such as log and help files.

The Execute button allows users to run commands in the do-file, displaying all commands and their respective outputs When text is highlighted, the button changes to Execute Selection (do), enabling the execution of only the selected lines while still showing all output This functionality will be referred to as the Do button.

In addition to directly editing and saving a do file, Stata allows users to send highlighted commands or the entire content from the Review window to the Do-file Editor for interactive work.

The diagram below shows how a do-file looks like:

In Stata 11 and later, the text color changes while typing, showcasing the Do-file Editor's syntax highlighting Users can customize the colors and text properties of syntax elements by navigating to Edit > Preferences and selecting the Syntax Color tab Common default features of a do-file enhance the coding experience.

The green color signifies a user-written description of a procedure To document any actions for future reference, simply precede your statement, word, or phrase with an asterisk (*).

To describe something inline with a command, you can use double slashes (//) However, be cautious, as three slashes (///) serve a different purpose, specifically for creating loops, which may not be covered in this course.

 The blue color is for a command while words between inverted commas (“ ”) are red.

Using logs

To ensure your analysis is easily replicable, it's essential to maintain a detailed lab notebook, similar to a bench scientist While the intense focus of your work may give you a sense of complete understanding, this feeling is often short-lived, and crucial details may fade by the next day Fortunately, Stata provides a built-in log file that serves as a digital lab notebook, helping you capture all necessary actions for perfect duplication of your work.

A log file serves as a comprehensive record of your Results window in Stata, capturing all commands and textual output in real time This feature acts as a digital lab notebook, ensuring that your work is documented as you progress By saving the log file to disk simultaneously with the Results window, it provides a safeguard against potential data loss due to power failures or computer crashes Therefore, it is advisable to initiate a log file at the start of any significant work in Stata.

Stata allows users to capture all output from the Results window in a log file, which can be saved in two formats By default, the log file is saved in Stata Markup and Control Language (SMCL) format, preserving all formatting and links, and can be viewed similarly to the Results window Alternatively, users can opt for a plain-text log file without formatting It is recommended to use the SMCL format, as it can be easily translated into various formats compatible with other applications via the File > Log > Translate menu.

To initiate a log file, simply click the Log button, which will prompt a standard file dialog This allows you to choose a directory and filename for your log If you omit the file extension, it will automatically be assigned as smcl.

Clicking the log file icon while a log file is active prompts you to choose whether to view a snapshot, suspend, close the log file, append new logs, or overwrite the existing file with new log data.

If you choose an existing log file, you will be asked whether you want to view, append or overwrite the existing log file

You can view the log file using the Viewer window in two ways:

To print a standard SMCL log file, first open it in a Viewer window Then, click the Print button, right-click the Viewer window and choose Print, or navigate to File > Print This action will open a Print dialog, and after selecting Print, an Output Settings dialog will appear.

Defining Stata Working Folder

Stata operates within a single predefined folder during each working session; if not specified, it defaults to the installation folder, typically located in My Documents Over time, this can result in the accumulation of numerous files when using Stata.

Properly storing files is crucial for easy access later, especially in computer applications One significant benefit of pre-defining a working folder is that all files generated during a session are automatically saved there Therefore, it is highly recommended to set the working folder's path for each session using the 'cd' (change directory) command For instance, if a working folder named "Project" has been created, this practice ensures organized file management.

“Stata Training” in “My Documents” in Drive D, our working folder will be defined by giving the whole route to the “Stata Training” as follows: cd “d:\My Documents\Stata Training”

To minimize typing errors in Stata, especially when dealing with long folder paths, it is advisable to copy the folder route directly from File Explorer and paste it into the Command Window or Do-file This method helps ensure accuracy, as Stata is particularly sensitive to even minor typing mistakes.

The Data Editor

The Data Editor provides a spreadsheet-style interface for viewing and managing data currently in memory, allowing users to enter new information, modify existing entries, and adjust data attributes like variable names, labels, and display formats Additionally, it features two windows for variable manipulation: the Variables window and the Properties window, which function similarly to their counterparts in the main Stata interface.

Using the Data Editor in Stata allows you to issue commands as if you typed them directly into the Command window, enabling you to maintain accurate records and learn commands effectively.

The Data Editor in Stata allows you to maintain a live view of your dataset while you work, enhancing your workflow To safeguard your data from unintentional modifications, the Data Editor features two modes: edit mode for making changes and browse mode for simply viewing the data In browse mode, editing is disabled, ensuring the integrity of your dataset.

PRACTICAL SESSION 1: Exercises on Syntaxes

You will be provided with Dataset(s) to have a hands-on experience with Stata

Environment Among other things, you will be guided to create your own do-files, log-files, define a working folder, opening, importing and saving data files, etc

Editor window is disabled It is highly recommend that you use the Data Editor in browse mode and switch to edit mode only when you want to make changes

The toolbar for the Data Editor has some standard buttons and some buttons we have not yet seen:

Open: Opens a Stata dataset Stata will warn you if your current dataset has unsaved changes

Save: Saves the dataset visible in the Data Editor

Copy: Copies the current selection to the Clipboard

Pasting from the Clipboard allows you to insert content into a selected cell, which will serve as the upper-left corner for the pasted data Be cautious, as this action will overwrite any existing information in the selected cell.

Edit Mode: Changes the Data Editor to edit mode

Browse Mode: Changes the Data Editor to browse mode for safely looking at data

Filter Observations: Filters the observations visible in the Data Editor This button is useful for looking at a subset of the current dataset

Right-clicking in the Data Editor opens a contextual menu for data manipulation and viewing options This menu provides access to various common tasks within the Data Editor window.

The Data Editor in browse mode allows users to view data without making any alterations, ensuring that stray keystrokes do not affect the dataset To access this mode, simply click the Data Editor (Browse) button or type "browse" in the Command window In browse mode, all options that could modify data, labels, or display formats are disabled, although users can still view a variable’s properties through the Variable Properties menu Additionally, filtering observations and hiding variables are permitted, as these actions do not change the underlying dataset.

Even when the Data Editor is in browse mode, you can still utilize Stata menus and type commands in the Commands window to modify your dataset This allows you to observe the impact of your commands on the data, although direct changes through the Data Editor are not permitted.

Variable Manager

The Variables Manager is a tool for managing properties of variables both individually and in groups It can be used to create variable and value labels, rename variables, change display

20 formats, and manage notes It has the ability to filter and group variables as well as to create variable lists

The Variables Manager offers valuable features for efficiently managing large datasets, as each action triggers a command in Stata, similar to entering it in the Command window This functionality not only helps maintain accurate records but also facilitates learning Stata commands through practical use of the Variables Manager.

You open the Variables Manager by selecting Data > Variables Manager or clicking on the Variables Manager button.

Labelling Data

Naming variables

When naming variables in Stata, it's important to keep in mind that while names can be up to 32 characters long, many commands display only the first 12 characters, making shorter names more practical Additionally, Stata is case sensitive, meaning "Age" and "age" are treated as distinct variables To enhance clarity and efficiency, it's advisable to adopt a consistent naming convention, favoring short, lowercase names or abbreviations—such as "effort" or "fpe" instead of "family_planning_effort" or "familyPlanningEffort." Using underscores to separate words can also improve readability while ensuring all names remain within legal limits.

Labeling variables

Variables can be labeled using the following Stata syntax label variable var1 "description" where var1 is the variable to be labeled; and description is the label of var1

Dialog Box: Data > Variables Manager

Example: Label a variable called nation_maize as total maize production in the nation label variable nation_maize “total maize production in the nation”

Labeling the various levels of a categorical variable

In Stata, categorical variables can be labeled using two key syntaxes: first, by defining the categories with the command `label define var1 1 “name of the first category” 2 “name of the second category”`, and second, by applying these labels with `label values var1 var1`.

Where var1 is the name of the categorical variable; and 1 and 2 are the levels of the categorical variable

In a dataset, the variable "gender" is categorized into two distinct groups: 1 represents male and 2 signifies female To effectively label these categories, one can use the command "label define gender 1 'male' 2 'female'" followed by "label values gender gender" to assign the labels to the variable.

Generating new variables from existing variables(s)

The two most common commands for creating new variables are generate and egen depending on the definition of the new variable

Syntax: generate new_variable = expression

Example: Generate a variable called income which is the sum of farm income (fincome) and non- farm income (nfincome): generate income = fincome + nfincome

Dialog: Data > Create or change data > Create new variable

There are some details you should know about the generate command:

1 You will get an error message if you try to generate a variable that already exists

2 An algebraic calculation using a missing value yields a missing value, as does division by zero, the square root of a negative number, or any other computation which is impossible

3 If missing values are generated, the number of missing values in new variable is always reported If Stata says nothing about missing values, then no missing values were generated

The egen command is a powerful tool for generating new variables derived from summary statistics like sum, mean, minimum, and maximum It is particularly beneficial for analyzing data across variable groups or within specific observation groups.

Syntax: egen newvar = fcn(arguments)

In programming, "fcn" represents a function, where arguments can be an expression, a list of variables, or a list of numbers Key functions include rowmin, which calculates the minimum value of specified variables; rowmax, which finds the maximum; rowmean, which computes the average; rowmedian, for the median value; and rowtotal, which sums the values Notably, the egen function typically disregards any missing values present in the dataset.

To calculate the total maize output for Malawi, create a variable named nation_maize that sums the outputs from the three regions: north_maize, central_maize, and south_maize Use the egen command as follows: egen nation_maize = rowtotal(north_maize, central_maize, south_maize).

Dialog: Data > Create or change data > Create new variable (extended)

Another command that is frequently used with the generate command is the replace command replace modifies the content of a variable by replacing some or all of its continent

Syntax: replace var1 = a if var1==b where var1 is the variable to be modified; a is the replacement; and b is the content of var1 that is is to be replaced

Example: modify a variable called maize_output by replace the missing values with 0 replace maize_output = 0 if maize_output==

The contents of a variable can also be replace with contents of another variable or the results of an expression

Dialog: Data > Create or change data > Change contents of variable

Changing string to numeric and vice versa

The destring and tostring commands are used to change string variables to numeric, and numeric variables to string variables respectively

Syntax: 1) destring var1, generate (var2)

The first syntax transforms the string variable, var1, into a numeric variable named var2 In contrast, the second syntax converts var1 to a numeric variable by directly replacing the string data with its corresponding numeric data.

To convert a string variable named `string_var` into a numeric variable, you can either create a new variable called `numeric_var` or replace the existing `string_var` with numeric data Use the command `destring string_var, generate(numeric_var)` to create a new numeric variable, or apply `destring string_var, replace` to convert the string data directly into numeric format within the same variable.

Note 1: The tostring command has the same syntax as the destring command

A string variable can only be converted to a numeric format using the destring command if it exclusively contains numerical data If the string variable includes any letters or words, the destring command will not be effective.

Merging Datasets

Merging Datasets for Latest Stata Versions (11 and above)

The merge command is used to combine corresponding observations from the master dataset in memory with those from a secondary dataset, known as the using dataset, by matching on one or more key variables This command is capable of executing various types of matches, including one-to-one, one-to-many, many-to-one, and many-to-many.

In Stata 11 and later versions, the syntax for performing a one-to-one merge on specified key variables is as follows: `merge 1:1 varlist using "location and name of second file", keepusing(vars)`.

Many-to-one merge on specified key variables merge m:1 varlist using “location and name of second file”, keepusing(vars)

One-to-many merge on specified key variables merge 1:m varlist using “location and name of second file”, keepusing(vars)

Many-to-many merge on specified key variables merge m:m varlist using “location and name of second file”, keepusing(vars)

In the matching process, the key variables are identified by the name(s) specified in the varlist To merge specific variables from the using dataset into the dataset currently in memory, the keepusing option is utilized If the keepusing option is not included, all variables from the using dataset will be merged by default.

Merging Data Sets for Older Stata Versions (10 and below but works for newer versions as well)

For Stata 10 and earlier versions, it is essential to identify the master and using files Unlike later versions, there is no need to specify the 1:1 or m:1 relationships Instead, you only need to sort the identifying variable (id) in both the master and using files, which simplifies the syntax.

25 sort id merge id using filename

The above syntax assumes you have already defined your working folder as described in section

2.5 Otherwise, if you haven’t defined your working folder, the syntax should be sort id merge id using “location and name of second file”

The older syntax also works for upper versions (11 and up) but you will always get a notification that you are using an old syntax, e.g:

Don't be alarmed if you encounter the above message; the merging process functions correctly Many users prefer the old syntax for its simplicity Choose the method that feels most comfortable for you, as long as you adhere to the proper procedure.

After merging, it is always important to check how the files have been merged For this purpose,

Stata automatically generates a variable _merge (note the underscore) The _merge variable has numbers 1 up to 3, where:

 1 indicates the number of cases from the master file not successfully merged (sometimes for a good reason but always try to figure out why);

 2 indicates the number of cases from the using file not successfully merged (also might be for a good reason but always try to figure out why); and

 3 indicates the number of cases that have been successfully merged

To check the merging outcome, tabulate (tab) the _merge variable as follows: tab _merge

Below is an example of an output for tab _merge:

(note: you are using old merge syntax; see [D] merge for new syntax) merge houscode using env_forest_income_by_hh.dta

The above result shows that out of the 259 total observations, only 137 could be matched; 66 and

Out of the master file, 56 observations did not match with the respective file Ideally, all observations should align perfectly In our practical sessions, we will investigate the reasons behind the mismatches in observations.

Finally, we need to drop the _merge variable Otherwise, no further merging to the same merged file can be done drop _merge

Appending datasets

The append command appends a Stata dataset to the end of the dataset in memory

Syntax: append using “location and name of second file”

Dialog: Data > Combine datasets > Append datasets

To successfully append two datasets, it is essential that they contain the same number of variables with identical names When dealing with categorical data, ensure that the categories are consistent across both datasets, such as using the same labels for values like 1 for "Agree," 2 for "Disagree," and 3 for "Don't know."

Collapsing Variables

This command converts the data into a dataset of summary statistics, such as sums, means, medians, and so on

Syntax: collapse (sum_stat) varname, by(categorical_var)

Sum_stat represents any statistic from the provided table, while varname refers to the variable designated for collapsing Additionally, categorical_var is the name of the categorical variable used in the collapsing process.

Example 1: Say you have a monthly dataset that you want to aggregate to annual data: collapse (sum) monthoutput, by(year)

Example 2: Say you have a firm-level data that you want to aggregate to industry level: collapse (sum) firmoutput, by(industry)

Note that if your dataset contains other variables beside the indicator variables and the variables you are collapsing, they will be erased

When using the collapse command in Stata, a key issue is its handling of missing values, as it disregards them in summary statistics calculations Understanding the number of non-missing observations used to compute averages is crucial This can be accomplished by incorporating a variable and utilizing the rawsum statistic, which does not consider weights Consequently, the collapsed variable will provide the count of observations effectively.

Keep and drop

To optimize your dataset in Stata, it's essential to remove irrelevant variables and observations to conserve memory and prevent them from affecting your analysis You can instruct Stata to either retain the variables you need or eliminate those you don't, achieving the same outcome For instance, to discard unwanted variables, you can use the commands: keep yield fert seed labor hybrid and drop organic soil_good soil_fair.

You can also drop or keep observations using the drop or keep command together with the if command keep if yield >= 0 drop if yield < 0

When initially importing data into Stata, it's essential to review the dataset to ensure that all variables and observations are included and formatted correctly.

List

The browse and edit commands open a pop-up window for examining raw data, while the list command allows you to view data within the results window, although this is practical only for small datasets For larger datasets, various options can be utilized to make the list output more manageable.

The list command displays the values of variables If no varlist is specified, the values of all the variables are displayed

The varlist represents the variables to be displayed, while options refer to any combination of choices linked to the list command The accompanying table outlines the various option commands associated with the list command.

Gain practical experience in data management with provided datasets, where you'll learn to recode, rename, label, create, and collapse variables You'll also practice entering and editing data, as well as importing and exporting various data formats Additionally, you'll explore techniques for merging and appending datasets from different files.

Dialog: Data > Describe data > List data

Browse/Edit

The browse command is similar to edit, except that modifications to the data by editing in the grid are not permitted browse is a convenient alternative to the list command

The edit command opens a spreadsheet-style data editor, enabling users to enter new information and modify existing data Unlike other commands, it allows for manual adjustments to the dataset, providing greater flexibility in data management.

Edit using Data Editor edit varlist, nolabel

Browse using Data Editor browse varlist, nolabel nolabel causes the underlying numeric values, rather than the label values (equivalent strings), to be displayed for variables with value labels

When working with large datasets containing numerous variables, it's often necessary to focus on just a few at a time The use of the varlist in the syntax allows you to specify which variables to examine, making it easier to analyze specific data points Simply list the desired variables following the edit and browse commands to streamline your data review process.

Assert

When working with large datasets, checking every observation individually can be impractical Stata offers several commands to effectively examine data, one of which is the 'assert' command that allows users to verify the truthfulness of specific statements.

For example, you might want to check whether all values in the yield variable are nonnegative as they should be: assert yield !< 0 or assert yield >= 0

If the statement is true, assert does not yield any output on the screen If it is false, assert gives an error message and the number of contradictions.

Describe

The describe command produces a summary of the dataset in memory or of the data stored in a Stata-format dataset

Describe data in memory describe varlist, memory_option

Describe data in file describe varlist using “location and name of the file”, file_options

Dialog: Data > Describe data > Describe data in memory or in a file

Codebook

The codebook command examines the variable names, labels, and data to produce a codebook describing the dataset

Codebook without a list of variables will give information on all variables in the dataset

Dialog: Data > Describe data > Describe data contents (codebook)

Summarize

This provides summary statistics, such as means, standard deviations, and so on

Tabulate

The tabulate command is a versatile command that can be used, for example, to produce a frequency table of one variable or a cross-tab of two variables

Syntax: 2) tabulate varname1 varname2, options

Statistics > Summaries, tables, and tests > Frequency tables > One-way table

Statistics > Summaries, tables, and tests > Frequency tables > Multiple one-way tables

Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with measures of association

Statistics > Summaries, tables, and tests > Frequency tables > All possible two-way tables

Inspect

The inspect command offers a visual representation of a variable's distribution, featuring a mini-histogram that aids in identifying outliers and unusual values It effectively highlights non-integer values in variables expected to contain only integers Unlike the summaries generated by the summarize or tabulate commands, inspect provides a quick overview of a numeric variable, detailing the counts of negative, zero, and positive values, as well as integers versus non-integers, unique values, and missing data.

34 histogram Its purpose is not analytical but is to allow you to quickly gain familiarity with unknown data

Dialog: Data > Describe data > Inspect variables

Graph

The graph commands draws graph

The table below provides the syntaxes for various graphs in Stata

The “Graph” tab on the manu bar can also be used.

Correlations

Correlation quantifies the relationship between variables, with the correlate command generating a correlation or covariance matrix for multiple variables Meanwhile, the pwcorr command reveals all pairwise correlation coefficients among these variables.

35 correlate variable_list, correlate_options pwcorr variable_list, pwcorr_options

[Also refer to Appendix 6 for a summary of data manipulation techniques]

Hypothesis Testing

Each estimation provides a two-sided t-test for linear regressions and a z-test for logit or probit regressions, assessing the null hypothesis that the true coefficient equals zero for each independent variable Additionally, you can conduct an F-test or chi-squared test on this hypothesis by using the command: regress yield fert seed hybrid organic.

To test whether seed hybrid and organic are jointly equal to zero, use the following syntax test seed hybrid organic

The test results will appear in the Results window

You can conduct tests on linear hypotheses regarding coefficients, such as testing if the coefficient on 'fert' equals 0.5 and if the coefficient on 'seed' equals 2.5 For statistical significance, you can apply the Sidak-adjusted significance level or the Bonferroni-adjusted significance level Use a star (*) to denote significance levels when displaying results, and print significance levels for each coefficient The term 'casewise' is synonymous with 'listwise,' which refers to using listwise deletion to manage missing values Additionally, you can print the number of observations for each entry.

The main `pwcorr_options` include settings that enable wide matrices to wrap for better covariance display, control the display of covariances, and manage the format associated with variable presentation Additionally, these options allow for the display of means, standard deviations, minimums, and maximums in a matrix format.

This article provides a concise theoretical overview of econometric and statistical modeling, with additional resources available in Appendices 7a, 7b, and the final section of Appendix 8b Readers are encouraged to refresh their knowledge of Statistics and Econometrics For those lacking a background in these fields, efforts will be made to simplify the concepts It is also recommended that individuals consider enrolling in introductory courses in statistics or econometrics in the future.

Estimation Procedure

Stata offers a wide range of regression options, both general and specialized, with similar syntax across most commands This article will focus on a select few commands in detail, while also providing a brief overview of other notable regression commands, including ANOVA (analysis of variance and covariance), censored-normal regression (cnreg), Generalized methods of moments estimator (gmm), Heckman selection model (heckman), interval regression (intreg), instrumental variables regression (ivregress), Newey-West standard errors regression (newey), Prais-Winsten regression (prais), quantile regression (qreg), ordinary least squares regression (reg), three-stage least squares regression (reg3), robust regression (rreg), seemingly unrelated regression (sureg), Tobit regression (tobit), treatment effects model (treatreg), and truncated regression (truncreg).

PRACTICAL SESSION 3: Basic Data Analyses

In this session, you will gain practical experience with datasets, focusing on fundamental data analyses techniques We will cover essential descriptive data analyses, including the creation of graphs, conducting hypothesis testing, and performing correlation analyses.

The article discusses various econometric methods for analyzing panel data, including the Arellano-Bond linear dynamic panel-data estimator (xtabond), interval regression models (xtintreg), fixed- and random-effects linear models (xtreg), as well as fixed- and random-effects linear models that accommodate AR(1) disturbances (xtregar) Additionally, it covers the application of Tobit models for panel data analysis (xttobit).

The dialog box all regressions can be found by clicking the statistics button on the manu bar The regress dialog box for instance can be accessed as follows:

Statistics > Linear models and related > linear regression

The figure below shows the regress dialog box

In Stata, users can specify dependent and independent variables, along with if/in qualifiers and weights, through the dialog box By default, a constant is included, but this can be suppressed by selecting the appropriate option Standard errors are calculated under the assumption of homoscedasticity, but users can opt for robust standard errors using the White-Huber method to address heteroscedasticity Additionally, the cluster option allows for accounting for correlation within clusters, enhancing flexibility in analysis Stata provides a 95% confidence interval by default, which can be adjusted using the relevant options.

Most estimation commands in Stata follow a similar syntax, but the available options may vary, so it's advisable to consult the relevant help files for guidance Stata provides several default settings, allowing users to forgo specifying options unless they want to modify the defaults, making it beneficial to understand what those defaults are.

Post-estimation

After conducting your regression analysis, it's essential to perform additional evaluations, including forecasting and hypothesis testing To facilitate this, a variety of post-estimation commands are available that can enhance your analysis and provide deeper insights into your data.

Prediction

After executing estimation commands like reg, cnsreg, logit, or probit, several predicted values can be derived, with the most significant being the predicted values for the dependent variable and the predicted residuals For instance, consider re-running the basic regression to illustrate this point.

39 regress yield fert seed hybrid organic predict y_yield : predicts the values for dependent variable predict r : predicts the residuals of the model

Stata generates new variables that hold predicted values, which can be utilized in various Stata commands For instance, you can create a histogram of the residuals to assess their normality.

[Also refer to Appendices 8a and 8b for some details on Probit and Logit models]

The Centre for Agricultural Research and Development (CARD) at LUANAR is pleased to present this Manual as an introduction to Stata, marking the beginning of a comprehensive training series Following this foundational course, we will offer Intermediate and Advanced Stata Analysis sessions Future courses will integrate these analyses with practical applications in policy analysis, tailored to meet the needs of participants.

Once again thank you for attending the course and/or using the Manual You are welcome to make any suggestions and comments

PRACTICAL SESSION 4: Regression Analysis – Linear Regression (OLS)

In this session, you will gain practical experience with Linear Regression Analysis by utilizing provided datasets We will cover essential aspects including pre-estimation checks, the estimation process, post-estimation evaluations, and making predictions based on the analysis.

PRACTICAL SESSION 5: Regression Analysis – Binary Logistic Regression

In this session, you will gain practical experience with Binary Logistic Regression Analysis using provided datasets We will cover essential aspects including pre-estimation checks, the estimation process, post-estimation analysis, and making predictions.

Appendix 1: Introduction to Policy Analysis

• A policy is a course of action or inaction chosen by public authorities to address a given problem or interrelated set of problems

– Increased competition to lower transport cost and improve quality of service (foreign vs local)

– Devaluation of the currency to stimulate exports

– Lifting the ban on maize exports

– Setting the minimum price of tobacco

– Bill proposing the minimum age for a girl to get married

– Amendment of the Constitution to restrict the incumbent VP from taking the high office of the President

– Implementation of the social cash transfer programme

– Gathering info on problems (causes & effects)

– Identifying several ways of dealing with the problem (alternatives)

– Assessing the likely results of those alternatives

• Involves both empirical and normative issues

• Involves both science and politics

• Utilizes many disciplines (e.g economics, statistics, management science)

• Use of reason and evidence to choose course of action to attain a given set of goals/objectives;

• A body of concepts and principles aimed at helping the decision maker make choices intelligently, ethically and effectively

• Analyzes existing practices/policies for effectiveness

• Need for thoughtful, impartial assessment of problems and seek solutions

• Avoid “shooting from the hip” without knowing the underlying causes

• Anticipate potential outcomes and helps to plan for mitigation actions

• Assessing impact (ex-post evaluation)

• Evaluating alternative policy prescriptions/ redesigning policies/programs

• Reducing uncertainty & providing information for decision makers in the public arena

• As systematic evaluation of the technical and economic feasibility and political viability of alternative policies, strategies for implementation, and consequences for policy adoption

1 What should be our goals?

2 Which option or option mix promises fewest negatives and greatest benefits?

1 Is the policy politically viable?

2 What variables are available to help ensure the successful implementation of the policy?

1 By what criteria can be policy be judged fair? Judged good?

– a subset of policy analysis methods comprising quickly applied but theoretically sound ways to aid in making policy decisions

– a subset of policy analysis methods requiring substantial budget, time and data to achieve results

– Search for truth and build theory about policy actions and effects

– May be too theoretical for most decision makers

– May take too much time for most decision makers

– Analyze alternatives to solve problems

– Goal is for practical value

– Research can be too narrow due to time or resource constraints

– Examples: Most academic research (e.g., Impact of decentralization on public health service delivery

– Advocate and support preferred policies

– Often ideological or partisan; may lack analytical depth

– Examples: Whether or not to continue hosting AU Summit

• Define and analyze the problem

– Who is affected and how seriously

– What are the causes & how did it develop

– What are the options for dealing with it

– Might be the most important step

Steps in Policy Analysis (con’t)

– What is most suitable: effectiveness, efficiency, equity, political & social feasibility

– Will vary depending on the problem

– Which is likely to produce desired outcomes

– Which alternative is most desirable

– Single policy action, or combination 16

• Root causes vs pragmatic adjustments

– Should focus be on the underlying issues or on addressing the issue at hand

• Comprehensive vs short-term relevance

– Comprehensive is more thorough; better methodologically, but also takes more time

– Short-term may be less rigorous and raise quality concerns, but may also be more timely

What Analysis Is Needed? (con’t)

– Should analysts do studies which closely adhere to mainstream values (norms), or

– Should they do studies which challenge those values

• Rational, technical analysis vs democratic politics

– Rational, technical analysis tends to focus efficiency, & to rely on highly trained analysts

– What about citizen involvement also be considered when making decisions in a democratic state

– Bias or funding source of the analyst 18

• Approach to problems that is logical, structured, valid, and replicable

• Generation of feasible courses of action

• A search for information and evidence of benefits and other consequences of courses of action

• In order to help policy makers choose the most advantageous policy action

– Projecting future states with and without policy or program

– Did policy or program achieve its objective?

– Was policy or program efficient, equitable, and politically acceptable?

– Hoe best should it have been designed/ implemented to achieve maximum impact?

Data Management Analyses Using Stata: An

• Stata is a powerful statistical package with:

– A wide range of up-to-date statistical techniques – A graph-producing capability (arguably not the best at this)

– Fast and relatively easy to use

– Not an abbreviation but corruption of the word Statistics

• Numerous comparable packages on the market, e.g SPSS, R, SAS, Matlab, Eviews, etc

• So why Stata and not them?

– Capability to handle and manipulate large datasets (up to slightly over 2 billions of observations!)

– Constantly being updated or advanced by users

– In short as one user summed it all:

• “When working with Stata it’s like you are talking to a very obedient person who does exactly what you want them to do.”

• Four different types (sizes) available for each version of Stata:

Stata/MP 32,767 10,998 2,147,583,647* • Runs on multiple CPUs or cores(but also on single core)

Stata/SE 32,767 10,998 2,147,583,647* • Runs on single core

• Can run on multiple core computers but uses only single core

*Assuming you have enough memory

• Stata/SE Version 13 to be used in this course

• Stata is command-driven package

• Also has pull-down menus as will be shown, but:

– Best way to learn Stata is through typing the commands

– For those interested in programming, typing makes it easy to switch into programming

– Arguably, you get the best out of Stata through typing the commands

What is this Course About?

• This course is designed to suit the needs for those who wish to acquire basic skills to analyze statistical data sets (for policy analysis) and produce technical reports

• After completion of this course the delegate should be able to perform data entry, manipulation and analysis in STATA as outlined in the program

• Stata is one of the big three general purpose statistical programs

• Version 1 Born in 1985, current running as ver 13

• Stata is one of the big three general purpose statistical programs Version 1 Born in 1985, current running as version 10

• It is an intuitive data management and manipulation, wide range of Statistics

• Fully programmable publication quality graphics

• Inexpensive (c.f SPSS, SAS) and widely used

• Stata is one of the big three general purpose statistical programs Version 1 Born in 1985, current running as version 10

• It is an intuitive data management and manipulation, wide range of Statistics

• Fully programmable publication quality graphics

• Inexpensive (c.f SPSS, SAS) and widely used

• Stata provides easy access to important analyses that are not available in many standard statistical packages

1 Panel / cross-sectional time series data

• Stata includes commands for tobit regression (tobit), heckman selection models (heckman, heckprob)

• Specialized statistical capabilities at no additional cost (e.g xt: Time series, svy: Survey, st: Survival, Robust S.E’s e.t.c.)

• Handles complicated data collected from complicated designs

• Stata is an excellent tool for data manipulation:

Efficiently transferring data from external sources into your program involves several key steps, including data cleaning, creating new variables, and generating summary data sets Additionally, it's crucial to merge data sets while checking for any errors in the merging process Finally, reshaping data sets from a "long" format to a "wide" format enhances data analysis and visualization.

• Stata provides all of the standard univariate, bivariate and multivariate statistical tools, from descriptive statistics and t- tests through one-, two- and N-way ANOVA, regression, principal components, and the like

• Stata has a very powerful set of techniques for the analysis of limited dependent variables:

– logit, probit, ordered logit and probit, multinomial logit, etc

– Stata’s regression capabilities are full-featured,

• regression diagnostics, prediction, robust estimation of standard errors

• instrumental variables / two-stage least squares,

• seemingly unrelated regressions / three-stage least squares, etc

• Stata graphics have been extensively improved and enhanced

• Graphics are excellent tools for exploratory data analysis,

• Produce high–quality 2-D publication-quality graphics, in a variety of

• Stata is very well supported by telephone and email technical support, as well as the more informal support provided by other users on StataList, the listserv

• The manuals are useful–particularly the User’s Guide

• Full details of the command syntax are available online in the windowed “Help Viewer” in hypertext form

• Even in the command–line environment, the full help files are available

• STATA runs interactively with short and simple commands, making it relatively easy to learn

• STATA allocates a default amount of memory into which it loads a copy of the input data set- very fast to produce results

• Stata is provoked by double-clicking on the STATA icon

• You enter your commands in the Command Window

• Command Window- You enter your commands in this window

• Review Window records your commands

• Results window displays your output

– The Results window is the Log Window This can be named, save, and reused later

• Variables window lists the variables in the data set you are using

• The Display Colours may be adjusted by going into the Edit window, clicking on general preferences, and selecting the background colours for the different windows

• In the header bar at the top of the screen is a list of topics:

• Version 10: File, Edit, Data, Graphics, Statistics, User, Window, and Help

• The Help option in the Header bar provides a Contents option and a Search Option

• The ‘Contents’ option can be used by beginners unfamiliar with STATA commands

• The ‘Search’ option can be used by users who know the name of the command or topic they wish to search

• The STATA web site, at http://www.stata.com ,

– You can find many useful links, resources,

– Available publications on how to use STATA

– Archive of solutions problems- stata serverlist

[prefix]: command [varlist] =[exp] [if] [in] [weight] [using filename] , [options]

• [Prefix]: Some commands precede a Stata command and modify its behaviour

– (e.g., regression by EPA—by EPA, reg dep var indep varlist

• command: Tell STATA what it should do for you

• [Varlist]: List of one or many variables

– sum age landsize income sex

[prefix]: command [varlist] =[exp] [if] [in] [weight] [using filename] , [options]

• [= exp]: Used in commands where algebraic expressions produce a new or update variable- DATA MANIPULATION

• [if] & [in]: Conditions and ranges- choose data to do analysis on

• [using]: Analysis to be done on data saved with a file name

• [, options]: These are specified depending on the type of analysis

– e.g., estimation of robust standard errors instead on the normal errors

• GOOD NEWS: All Stata commands have help files

• The help command shows you how to use the command as well as the options associated with a particular command

– help mean; help heckman; help graph

• Use the search command for any type of analysis or procedure

– Search regression; Search tobit; search switching regression

Introduction to Stata Interface and Syntaxes

Parts of the STATA Interface

1 Manu bar (pull down manu)

Stata Interface: Windows Window Function

Variables The Variables window shows the list of variables in the dataset, along with selected properties of the variables

Properties The Properties window displays variable and dataset properties

Command Commands are submitted to

Stata from the Command window

Results The Results window contains all the commands and their textual results you have entered during the Stata session

Review The Review window shows the history of commands that have been entered

• Provides another way of telling STATA to do what you want

• Click on each on of them for the pull down manus

The Stata Syntax – The Language of STATA

• Example 1: descriptive statistics (mean, min, max etc) of yield fert seed for the three region districts of Malawi

Syntax: by region: summarize yield fert seed if year= 10 ,detail

Prefix command varlist if option

Appendix 3: Data Entry, Import and Export

Data Entry, Importing and Exporting

1 Double click the Stata data file to load it into Stata

2 From the Stata interface, click on “file” and then on “open”

3 You can also use the “open” icon on the manu bar to do the same

 Other formats (Excel, SAS etc)

Click on “file” on the manu bar In the file drop down manu, click on

“Export” and then choose the format of your dataset The import dialog box will appear Choose the options in the dialog box as appropriate and click on “OK”

 Using the command window: a Stata file: use command

Syntax: use “location of file\filename", clear

Example: use "C:\Users\user\Desktop\responserate.dta", clear b Excel file: import command

Syntax: import excel “file location\filename.xls", sheet("Sheet1") firstrow

Example: import excel "C:\Users\user\Desktop\responserate.xls", sheet("Sheet1") firstrow c CSV file: insheet command

Syntax: insheet using “file location\filename.csv”

Example: insheet using "C:\Users\user\Desktop\responserate.csv” d SPSS file: usespss command ssc install usespss (installs usespss) usespss using location of file\filename.sav

Example: usespss using C:\Users\user\Desktop\responserate.sav

1 Open STATA using the start button or the icon on the desktop

2 Click on the “Data Editor” on the toolbar

3 Enter data variable names appear as var1, var2, var3, …,

String data appear in red

• Saving in STATA format for the first time:

1 select File > Save As and use the Save As dialog box to complete the saving OR

2 type save filename in the Command window

• Saving a modified data in STATA format

2 click on the Save button; OR

3 type save, replace in the Command window

Select File > Export , then select the format and use the dialog box to complete saving

Appendix 4: Data Documentation (Log File)

Stata Documentation: Creating a Log File

• A log file is simply a record of your Results window It records all commands and all textual output as it happens

• All the output that appears in the Results window can be captured in a log file

• It is recommend that you start a log file whenever you begin any serious work in Stata

• Stata can save the file in one of two different formats

1 Stata Markup and Control Language (SMCL) format (Default)

• SMCL format is recommended because SMCL files can be translated into a variety of formats readable by applications other than Stata

2 Issue commands to get output

3 Suspend log ??? : File > Log > Suspend

4 Resume log ??? : File > Log > Resume

• If you choose an existing log file, you will be asked whether you want to view , append or over-write the existing log file

Viewing and translating Log files

• You can view the log file using the Viewer window in two ways:

You can convert from SMCL format to other formats:

Merging and Appending Datasets in Stata

The merge command joins corresponding observations from the dataset currently in memory (called the master dataset) with those from a second dataset (called the using dataset)

Adding a Stata dataset to the end of the dataset in memory

1 One-to-one : Unique observations in both master and using datasets

1 One-to-many : Unique observations in the master dataset, but duplicates in the using dataset

1 Many-to-one : Duplicate observations in the master dataset, but unique in the using dataset

• One to one merge 1:1 varlist using “location and name of second file”, keepusing(vars)

• One to many merge 1:m varlist using “location and name of second file”, keepusing(vars)

• Many to one merge m:1 varlist using “location and name of second file”, keepusing(vars)

Using drop down manu: Data > Combine datasets > merge two datasets

• Syntax: append using “location and name of second file”

• Using the pull down manu: Data > Combine datasets > Append datasets

Exercise1: Exporting and Importing data

1 Download the auto.dta and auto2.dta datasets

File > Example Datasets… > Example datasets installed with Stata > click once on the use by auto.dta > save the file as auto.dta

2 Download auto2.dta using the same procedure and in 1

3 Load auto.dta into stata (using a do file)

Command: use "C:\Users\user\Desktop\auto.dta", clear

4 Export auto.dta to excel and the save the excel file as auto.xls

5 Import auto.xls into stata command: use “: import excel "C:\Users\user\Desktop\auto.xls", sheet("Sheet1") firstrow

• 1 Append the auto and auto2 datasets

Command: append using “location\auto2.dta”

• 2 Merge HouseholdGeovariables_IHPS.dta and Household\HH_MOD_A_FILT using 1:1 merging Command: merge 1:1 y2_hhid using “location\ HouseholdGeovariables_IHPS.dta”

• 3 merge HouseholdGeovariables_IHPS.dta and PlotGeovariables_IHPS.dta using 1:m merging

Command: merge 1:m y2_hhid using “location\PlotGeovariables_IHPS.dta”

• 4 PlotGeovariables_IHPS.dta and HouseholdGeovariables_IHPS.dta using m:1 dataset

Command: merge m:1 y2_hhid using “location\HouseholdGeovariables_IHPS.dta”

Introduction to STATA Documentation & Data

Prefix: bysort foreign: summarize price; //summarizes price for domestic and foreign vehicles

Command: sum price displacement gear_ratio length; //Gives summary statistics for the variables

*Use of if, equality and &: sum price if foreign==0 & price>00; //summarise price for domestic cars that are US$1000

Documentation in STATA version 13 clear

From this point forward, each line will conclude with a semicolon; Additionally, the "set more off" command will be applied permanently; The specified directory for research and outreach seminars is "D:\Director of Research and Outreach\Seminars."

Workshops\Stata\auto.dta"; /* open auto data saved on the computer*/ log using "D:\Director of Research and Outreach\Seminars znd

Workshops\Stata\day5.log", replace; //File that saves the results log close;

Exploring data involves several key steps to understand its structure and details Begin by describing the data in memory and consulting the codebook for comprehensive insights The codebook provides specific information about variables such as price and foreign classifications To get an overview of the dataset, list all variables, including mpg, length, and rep78 Utilize the inspect command to obtain various statistics, followed by the summarize function for summary statistics across all variables For detailed statistics on specific variables like mpg, length, trunk, and rep78, use the command with a detail option The browse function allows you to view the database in a spreadsheet format Additionally, you can filter the dataset to list vehicle makes that have a repair record of 5 or a fuel consumption greater than 25 mpg.

The article discusses generating statistical tables using various commands It begins with creating a table of statistics for foreign vehicles using the command `tab foreign rep78` Next, it summarizes vehicle prices by source with `tab foreign, summarize(price)` The cross-tabulation of foreign vehicles is illustrated through `tab foreign rep78, chi2 row col`, which includes chi-square tests and row and column percentages Frequency tables for selected variables are created using `tab1 foreign price mpg weight` and `tab2 mpg foreign`, highlighting fuel consumption against the vehicle's source The command `tabstat price mpg weight, statistics(mean min range max skewness)` provides descriptive statistics for these variables, while `tabstat price weight mpg rep78, by(foreign)` separates statistics for imported and locally manufactured vehicles Additionally, the mean price for observations 20 to 70 is calculated with `tabstat price in 20/70` Finally, the command `tabstat price weight mpg rep78, by(foreign) stat(mean sd min max) nototal long col(stat)` presents statistics in a horizontal format for better readability.

To analyze vehicle pricing data, various graphical representations can be generated A scatter plot of miles per gallon (mpg) against price can be created using the command `scatter mpg price` To identify outliers, a graph matrix of mpg, price, weight, and length can be utilized For domestic vehicles, a histogram of price can be drawn with `hist price if foreign==0`, while a comparative histogram by vehicle source can be created using `hist price, by(foreign)` Additionally, scatter plots can be generated for price, weight, mpg, and repair ratings with the command `graph matrix price weight mpg rep78`, and a half matrix version can be displayed with `graph matrix price weight mpg rep78, half` A box plot of price categorized by vehicle source can be produced with `graph box price, over(foreign)`, and a titled scatter plot of price versus weight can be visualized using `twoway scatter price weight, title(Plot of Price vs Weight)`.

Arithmetic Logical (numeric and string)

* multiplication ! not >= greater than or equal

/ division ~ not

Tiêu đề	Stata Training Manual 1
Tác giả	Charles Jumbe, PhD, Francis Darko, MSc, Thabbie Chilongo, PhD
Người hướng dẫn	Associate Professor and Director of Research and Outreach, Lilongwe University of Agriculture and Natural Resources (LUANAR), Bunda Campus, Malawi, PhD Scholar, Purdue University, USA, Research Fellow, Centre for Agricultural Research and Development (CARD), LUANAR, Bunda Campus, Malawi
Trường học	Lilongwe University of Agriculture and Natural Resources
Chuyên ngành	Agricultural Research
Thể loại	training manual
Năm xuất bản	2014
Thành phố	Malawi

Định dạng
Số trang	156
Dung lượng	5,23 MB
File đính kèm	94. STATA training.rar (5 MB)